Fix Gemma-4 inference crash when num_kv_shared_layers == 0 by danielhanchen · Pull Request #580 · unslothai/unsloth-zoo

danielhanchen · 2026-04-07T11:46:51Z

Summary

Running inference on unsloth/gemma-4-31B-it or unsloth/gemma-4-26B-A4B-it currently crashes at the first model.generate(...) call with:

IndexError: list index out of range

from transformers.cache_utils.Cache.update at self.layers[layer_idx].update(...). Reported from the official Gemma4_(31B)-Vision, Gemma4_(26B_A4B)-Text, and Gemma4_(26B_A4B)-Vision notebooks.

Root cause

transformers.cache_utils.DynamicCache.__init__ and StaticCache.__init__ both contain:

if hasattr(decoder_config, "num_kv_shared_layers"):
    layer_types = layer_types[: -decoder_config.num_kv_shared_layers]

When num_kv_shared_layers == 0 the slice becomes layer_types[:-0], which is layer_types[:0] == [] because Python evaluates -0 == 0. The cache is constructed with zero layer slots and the very first attention forward fails.

Bug present in transformers 4.57.6, 5.5.0, and main (same two sites in all three).

Who is affected

Only Gemma-4 variants with num_kv_shared_layers == 0:

Model	`num_kv_shared_layers`	Affected
`unsloth/gemma-4-31B-it`	0	yes
`unsloth/gemma-4-26B-A4B-it`	0	yes
`unsloth/gemma-4-E2B-it`	20	no
`unsloth/gemma-4-E4B-it`	18	no
Gemma3n	15 (default)	no

Every other model family does not define num_kv_shared_layers at all, so the buggy branch never fires for Llama, Qwen, Mistral, GPT-OSS, Gemma2, Gemma3, etc.

The fix

Two layers.

Primary: proxy get_text_config. Patches Gemma4Config.get_text_config and Gemma4TextConfig.get_text_config to return a thin _Gemma4KVSharedSafeProxy whenever the resolved text config has num_kv_shared_layers == 0. The proxy raises AttributeError for that one attribute so the upstream hasattr(decoder_config, "num_kv_shared_layers") check returns False and the buggy slice is skipped. Every other attribute is forwarded transparently to the real config via __getattr__.

Because Python looks up dunder methods on the type rather than the instance, __getattr__ alone is not enough. The proxy explicitly forwards __iter__, __len__, __contains__, __getitem__, __eq__, __hash__, __bool__, and __repr__ so that callers like PreTrainedConfig.validate_token_ids (which does for value in self.get_text_config(decoder=True): ... during config construction) continue to work.

Fallback: hardened Cache.__init__ wrappers. Installed as defense-in-depth in case a future transformers refactor bypasses get_text_config. They transiently delete num_kv_shared_layers on the decoder config, run the original init inside a try/finally that restores the value, and bail out to the original init on any mutation failure (rather than converting IndexError to TypeError via a None sentinel). For every non-Gemma-4 config (or Gemma-4 with num_kv_shared_layers != 0) the wrapper is a pure pass-through.

Downstream reads of config.num_kv_shared_layers in Gemma4TextMLP.__init__ and Gemma4TextAttention.__init__ still see the original 0 because they read from self.config directly, not from self.config.get_text_config(...). The proxy is never stored on the model.

Compatibility

	TRL 0.22.2	TRL 0.27.1	TRL 1.0.0
transformers 4.57.6	ok (no Gemma-4: proxy skips via ImportError)	ok	ok
transformers 5.5.0	ok (verified end-to-end on real 31B)	ok	ok
transformers main	ok (same bug sites, same fix)	ok	ok

TRL does not touch num_kv_shared_layers or wrap Cache.__init__, so the patch is invisible to it in all three versions. Compiled cache (unsloth_compiled_cache/unsloth_compiled_module_gemma4.py) is unaffected because Unsloth's compiler copies attention and MLP forward methods only, never cache code or config classes.

Verification

Unit tests

A 5-step local unit test covering:

Reproducing the bug on stock transformers (cache built with 0 layers).
Applying the patches and confirming DynamicCache builds 4 of 4 layers.
Proxy class behavior: hasattr hidden, forwarding works, copy.deepcopy works.
num_kv_shared_layers > 0 still trims correctly (4 of 6 layers).
Configs without num_kv_shared_layers (Llama-like) unaffected (7 of 7 layers).

All pass.

End-to-end on Gemma-4 31B-Vision

Ran the exact failing cell from the Gemma4_(31B)-Vision notebook on a B200 with transformers==5.5.0, unsloth/gemma-4-31B-it, load_in_4bit=True:

[patches] proxy + wrapper both registered
[load] done in 46.0s
[load] model.config.text_config.num_kv_shared_layers = 0
[load] model.config.text_config.num_hidden_layers    = 60
[cache] DynamicCache.layers after fix = 60 (expected == 60, would be 0 without fix)
[gen] starting model.generate(...) -- this is the previously-failing call
<|channel>thought
<channel|>The LaTeX code for the mathematical expression in the image is:

```latex
H' = \beta N \int d\lambda \left\{ \frac{1}{2\beta^2 N^2} \partial_\lambda \zeta^\dagger \partial_\lambda \zeta + \mathcal{V}(\lambda) \zeta^\dagger \zeta \right\} .

[gen] model.generate returned in 47.7s
[gen] output_ids shape: (1, 354)
OK: Gemma-4 31B inference survived model.generate() without IndexError.


The load-bearing line is `DynamicCache.layers after fix = 60`. Without the fix this would be 0 and the first `Cache.update(..., 0)` call would crash. With the fix, `model.generate` streams 354 tokens in 47.7s and returns a valid LaTeX expression rendering the hand-written formula in `dataset[2]["image"]`.

## Test plan

- [x] Unit tests for proxy and wrapper paths pass
- [x] `unsloth/gemma-4-31B-it` vision inference runs end-to-end on transformers 5.5.0
- [x] No em dashes, no emojis, no mention of bot tooling
- [x] Syntax check (`python -c "import ast; ast.parse(open(...).read())"`) passes
- [ ] CI on unsloth-zoo main

`transformers.cache_utils` DynamicCache.__init__ and StaticCache.__init__ contain: if hasattr(decoder_config, "num_kv_shared_layers"): layer_types = layer_types[: -decoder_config.num_kv_shared_layers] When `num_kv_shared_layers == 0`, Python's `-0 == 0` collapses the slice to `layer_types[:0] == []`, so the cache is built with zero layer slots and the first `Cache.update(..., layer_idx=0)` call raises `IndexError: list index out of range`. Affected: `unsloth/gemma-4-31B-it` and `unsloth/gemma-4-26B-A4B-it` (both ship with `num_kv_shared_layers: 0`). Gemma3n and Gemma-4 E2B/E4B are unaffected because they ship with `num_kv_shared_layers > 0`. Primary fix: patch `get_text_config` on `Gemma4Config` and `Gemma4TextConfig` to return a thin proxy that hides `num_kv_shared_layers` from `hasattr()` when the value is 0. The upstream branch is then skipped and the cache is built with the full layer list. Only Gemma-4 config classes are touched, so `DynamicCache` and `StaticCache` are byte-identical to upstream for every other model. Fallback: hardened `DynamicCache.__init__` and `StaticCache.__init__` wrappers are also installed as defense-in-depth, in case a future transformers refactor bypasses `get_text_config`. The wrappers are a pure pass-through for every non-Gemma-4 config (and for Gemma-4 configs with `num_kv_shared_layers != 0`). Verified end-to-end on the Gemma4_(31B)-Vision notebook with transformers 5.5.0: `model.generate()` on `unsloth/gemma-4-31B-it` now streams 354 tokens in 47.7s and returns a valid LaTeX expression, where it previously crashed at the first attention forward.

danielhanchen · 2026-04-07T11:51:57Z

Additional verification: 26B-A4B `use_cache=True` vs `use_cache=False`

Ran unsloth/gemma-4-26B-A4B-it (the other variant with num_kv_shared_layers == 0) on What is 1+1? with greedy decoding twice: once with use_cache=True and once with use_cache=False. Greedy (do_sample=False) so the two runs are directly token-comparable.

Setup

Model: unsloth/gemma-4-26B-A4B-it (load_in_4bit=True)
transformers 5.5.0, unsloth 2026.4.4, torch 2.9.1+cu128
NVIDIA B200, bf16
Chat template: gemma-4-thinking
Prompt: What is 1+1?
max_new_tokens=64, do_sample=False
torch.manual_seed(3407) before each run

Result

[patches] proxy + wrapper both active
[load] from_pretrained('unsloth/gemma-4-26B-A4B-it', load_in_4bit=True)
[load] done in 46.8s
[load] num_kv_shared_layers=0 num_hidden_layers=30
[cache] DynamicCache.layers after fix = 30
[input] prompt tokens: torch.Size([1, 16])

[run1] use_cache=True  (the previously-failing configuration)
[run1] done in 15.72s, 12 new tokens
[run1] decoded output:
thought
1 + 1 = 2

[run2] use_cache=False (recomputes KV at every step)
[run2] done in 10.49s, 12 new tokens
[run2] decoded output:
thought
1 + 1 = 2

[compare] use_cache=True vs use_cache=False
[compare] new-token counts: True=12  False=12
[compare] first 12 generated ids identical: True
[compare] decoded strings identical: True
[correctness] use_cache=True  answer contains '2': True
[correctness] use_cache=False answer contains '2': True

=== SUMMARY ===
use_cache=True   length=12  correct=True  time=15.72s
use_cache=False  length=12  correct=True  time=10.49s
outputs identical: True
OK: parity verified and answer is correct

What this confirms

Check	Result
`DynamicCache.layers` built with full 30 layers (not 0)	pass
`use_cache=True` generates without `IndexError`	pass
`use_cache=False` generates without `IndexError`	pass
Generated token ids identical across the two runs (12/12)	pass
Decoded strings identical	pass
Both outputs contain the correct answer `2`	pass

The cache path (which the bug report was about) and the no-cache path now produce bit-for-bit identical output on the 26B-A4B variant, and the answer is correct. Combined with the end-to-end LaTeX OCR run on the 31B variant in the PR description, both Gemma-4 variants that ship with num_kv_shared_layers == 0 are now verified working.

gemini-code-assist

Code Review

This pull request implements a fix for a transformers bug where Gemma-4 models with zero shared KV layers cause an IndexError during cache initialization. The solution utilizes a proxy object to hide the problematic attribute and includes defense-in-depth wrappers for DynamicCache and StaticCache. Feedback identifies a logic error in the positional argument resolution for DynamicCache and recommends ensuring the proxy's getitem method consistently raises KeyError for missing attributes to maintain mapping interface consistency.

gemini-code-assist · 2026-04-07T11:52:10Z

+    def _resolve(args, kwargs):
+        # DynamicCache.__init__(self, ddp_cache_data=None, config=None, ...)
+        config = kwargs.get("config", None)
+        if config is None and len(args) >= 2:
+            config = args[1]
+        if config is None:
+            return None
+        try:
+            return config.get_text_config(decoder=True)
+        except Exception:
+            return None


The resolution logic for positional arguments in DynamicCache appears to be incorrect for standard transformers versions. In the standard DynamicCache.__init__(self, config=None) signature, the config object is the first positional argument after self, meaning it should be accessed via args[0]. The current implementation uses args[1] and expects ddp_cache_data at args[0], which is not the case in stable transformers (e.g., 4.47.x, 4.48.x). This will cause the defense-in-depth wrapper to fail when DynamicCache is initialized positionally without keyword arguments. It is safer to handle both cases.

Suggested change

def _resolve(args, kwargs):

# DynamicCache.__init__(self, ddp_cache_data=None, config=None, ...)

config = kwargs.get("config", None)

if config is None and len(args) >= 2:

config = args[1]

if config is None:

return None

try:

return config.get_text_config(decoder=True)

except Exception:

return None

def _resolve(args, kwargs):

# DynamicCache.__init__(self, config=None, ...) or (self, ddp_cache_data=None, config=None, ...)

config = kwargs.get("config", None)

if config is None:

if len(args) == 1:

config = args[0]

elif len(args) >= 2:

config = args[1]

if config is None:

return None

try:

return config.get_text_config(decoder=True)

except Exception:

return None

gemini-code-assist · 2026-04-07T11:52:10Z

+    def __getitem__(self, key):
+        if key == "num_kv_shared_layers":
+            raise KeyError(key)
+        real = object.__getattribute__(self, "_real")
+        try:
+            return real[key]
+        except TypeError:
+            return getattr(real, key)


The __getitem__ implementation should raise a KeyError when an attribute is missing to remain consistent with the mapping interface it emulates. Currently, if real[key] raises a TypeError (which occurs for PreTrainedConfig as it is not subscriptable), it falls back to getattr(real, key), which raises an AttributeError if the key is not a valid attribute. This should be caught and converted to a KeyError.

Suggested change

def __getitem__(self, key):

if key == "num_kv_shared_layers":

raise KeyError(key)

real = object.__getattribute__(self, "_real")

try:

return real[key]

except TypeError:

return getattr(real, key)

def __getitem__(self, key):

if key == "num_kv_shared_layers":

raise KeyError(key)

real = object.__getattribute__(self, "_real")

try:

return real[key]

except TypeError:

try:

return getattr(real, key)

except AttributeError:

raise KeyError(key)

Second bug surfaced while validating the num_kv_shared_layers == 0 fix. Gemma4TextAttention.forward relies on a cache being present to share KV states across layers: if self.is_kv_shared_layer and past_key_values is not None: key_states, value_states = past_key_values.shared_layers[ self.kv_shared_layer_index ] else: # compute K/V locally from current hidden_states When the caller passes `use_cache=False`, Gemma4TextModel.forward skips cache auto-construction: if use_cache and past_key_values is None: past_key_values = DynamicCache(config=self.config) so past_key_values stays None. Every is_kv_shared_layer then falls through to the `else` branch and recomputes K/V from the current layer's hidden states instead of reading them from the earlier "store_full_length_kv" layer. The result is garbage logits: on `unsloth/gemma-4-E2B-it`, the top-1 next-token prediction goes from `'1'` (correct for "What is 1+1?") with `use_cache=True` to `'BRO'` with `use_cache=False`, and multi-token greedy generation produces `"BROAD\\肯. Specifically..."` instead of `"1 + 1 = **2**"`. See huggingface/transformers#45242. This affects Gemma-4 E2B (num_kv_shared_layers=20) and E4B (num_kv_shared_layers=18). It does NOT affect Gemma-4 31B or 26B-A4B, whose num_kv_shared_layers is 0, so no layer is_kv_shared_layer and the code path is byte-identical with or without a cache. Verified on both families. Fix: wrap Gemma4TextModel.forward and Gemma4Model.forward so that when num_kv_shared_layers > 0 and the caller passes past_key_values=None with use_cache falsy, a local DynamicCache is transparently injected for the duration of the forward, then nulled out in the returned BaseModelOutputWithPast so the caller's use_cache=False contract is preserved. Isolation between the two Gemma-4 fixes: - Fix 1 (num_kv_shared_layers == 0 slicing bug) only fires for values exactly 0 (31B, 26B-A4B). - Fix 2 (this commit) only fires for values strictly > 0 (E2B, E4B). - Non-Gemma-4 models hit neither branch (attribute absent). Verification (tests/test_gemma4_e2b_use_cache_bug.py on B200, transformers 5.5.0, unsloth/gemma-4-E2B-it, load_in_4bit=True): before fix: use_cache=True top-1: '1' output: '1 + 1 = **2**' use_cache=False top-1: 'BRO' output: 'BROAD\\肯. Specifically...' max_abs_logit_diff: 48.937500 after fix: use_cache=True top-1: '1' output: '1 + 1 = **2**' use_cache=False top-1: '1' output: '1 + 1 = **2**' max_abs_logit_diff: 0.000000 (bit-exact parity) generated ids identical across both runs (9/9 tokens) Re-ran tests/test_gemma4_26b_use_cache_parity.py with Fix 2 applied: both use_cache=True and use_cache=False still produce identical output on 26B-A4B ("thought\n1 + 1 = 2"), confirming Fix 2 is a no-op for the num_kv_shared_layers == 0 family. Fix 1 unit tests also still pass.

danielhanchen · 2026-04-07T12:03:48Z

Second Gemma-4 bug found, fixed in the same PR (E2B and E4B, `use_cache=False` with KV sharing)

While validating the first fix I went back and actually reproduced the upstream bug reported in huggingface/transformers#45242 (also covered by hiyouga/LlamaFactory#10346 and the Datta0/transformers branch). It is a different bug from the num_kv_shared_layers == 0 slicing crash and it hits the other half of the Gemma-4 family.

Who is affected by each bug

Model	`num_kv_shared_layers`	Fix 1 (slicing)	Fix 2 (this commit)
Gemma-4 31B	0	engages	no-op
Gemma-4 26B-A4B	0	engages	no-op
Gemma-4 E4B	18	no-op	engages
Gemma-4 E2B	20	no-op	engages
Gemma3n	15	no-op	no-op (different model)
Llama / Qwen / Mistral / etc.	absent	no-op	no-op

The two fixes are mutually exclusive at runtime. Fix 1 only acts when the value is exactly 0, Fix 2 only acts when the value is strictly greater than 0, so the isolation is natural and non-Gemma-4 models never enter either branch.

Root cause of Fix 2

Gemma4TextAttention.forward shares KV state across layers via the cache object:

if self.is_kv_shared_layer and past_key_values is not None:
    key_states, value_states = past_key_values.shared_layers[self.kv_shared_layer_index]
else:
    # compute K/V locally from current hidden_states

The cache is the only carrier for the KV values that the store_full_length_kv layers stash and that the is_kv_shared_layer layers later consume. But Gemma4TextModel.forward only auto-constructs the cache when use_cache is truthy:

if use_cache and past_key_values is None:
    past_key_values = DynamicCache(config=self.config)

When the caller passes use_cache=False (which every QLoRA tutorial does via model.config.use_cache = False, and which gradient_checkpointing=True forces as well), past_key_values stays None. Every is_kv_shared_layer then takes the else branch and recomputes K/V from the current layer's hidden states, which is wrong and produces garbage logits.

Reproduction on `unsloth/gemma-4-E2B-it` (`num_kv_shared_layers=20`), before Fix 2

[forward] top-5 use_cache=True : ['1', '$', 'The', '**', 'Answer']
[forward] top-5 use_cache=False: ['BRO', '\ufffd', '**', '    ', 'B']
[compare] max_abs_logit_diff: 48.937500
[compare] top-1 use_cache=True : '1'
[compare] top-1 use_cache=False: 'BRO'

[gen] use_cache=True  -> '1 + 1 = **2**'
[gen] use_cache=False -> 'BROAD\\肯. Specificallyboard\xa0K supposed\\_n\ufffd통  \\'
FAIL: use_cache=False diverges from use_cache=True

After Fix 2

[forward] top-5 use_cache=True : ['1', '$', 'The', '**', 'Answer']
[forward] top-5 use_cache=False: ['1', '$', 'The', '**', 'Answer']
[compare] max_abs_logit_diff: 0.000000
[compare] top-1 use_cache=True : '1'
[compare] top-1 use_cache=False: '1'

[gen] use_cache=True  -> '1 + 1 = **2**'
[gen] use_cache=False -> '1 + 1 = **2**'
[compare] generated ids match: True
OK: use_cache parity holds

Bit-exact logit parity (max_abs_logit_diff: 0.000000) and identical generated ids across the two runs.

What the fix does

Wraps Gemma4TextModel.forward and Gemma4Model.forward so that when num_kv_shared_layers > 0 and the caller passes past_key_values=None with use_cache falsy, a local DynamicCache is transparently injected for the duration of the forward call, then nulled out in the returned BaseModelOutputWithPast so the caller's use_cache=False contract is preserved. For every other code path (use_cache=True, past_key_values already provided, or num_kv_shared_layers <= 0) the wrapper is a pure pass-through.

Fix 1 is still safe

Re-ran tests/test_gemma4_26b_use_cache_parity.py with Fix 2 applied:

[load] num_kv_shared_layers=0 num_hidden_layers=30
[cache] DynamicCache.layers after fix = 30
[run1] use_cache=True  -> 'thought\n1 + 1 = 2'
[run2] use_cache=False -> 'thought\n1 + 1 = 2'
outputs identical: True
OK: parity verified and answer is correct

26B-A4B still produces identical output on use_cache=True and use_cache=False, confirming Fix 2's num_kv_shared > 0 guard correctly short-circuits for the num_kv_shared_layers == 0 family. Fix 1 unit tests (5 of 5) still pass.

Upstream references

[Gemma 4] use_cache=False corrupts attention computation, producing garbage logits huggingface/transformers#45242 (original bug report, still open)
[model] gemma4 hiyouga/LlamaFactory#10346 (same bug, LlamaFactory side)
Datta0/transformers:gemma4 branch (upstream fix in progress)

Once an upstream transformers release includes the Datta0 fix, both patches in this PR can be removed.

Commit log on this branch

7a910ad Fix Gemma-4 use_cache=False with KV sharing (E2B, E4B)
792f6c6 Fix Gemma-4 inference crash when num_kv_shared_layers == 0

+        _wrap_get_text_config_for_kv_zero(Gemma4Config)
+    except Exception as e:
+        return raise_error("Gemma4Config.get_text_config kv_shared_zero fix", e)
+pass


+        _wrap_get_text_config_for_kv_zero(Gemma4TextConfig)
+    except Exception as e:
+        return raise_error("Gemma4TextConfig.get_text_config kv_shared_zero fix", e)
+pass


+            return None
+
+    DynamicCache.__init__ = _make_kv_shared_zero_safe_init(DynamicCache.__init__, _resolve)
+pass


+            return None
+
+    StaticCache.__init__ = _make_kv_shared_zero_safe_init(StaticCache.__init__, _resolve)
+pass


+                except Exception:
+                    try:
+                        setattr(decoder_config, "num_kv_shared_layers", 0)
+                    except Exception:


+    cls.get_text_config = get_text_config
+
+
+def patch_Gemma4Config_kv_shared_zero():


danielhanchen · 2026-04-07T12:07:24Z

Gemma-4 cache bugs in one page

Two distinct bugs were blocking Gemma-4 inference and fine-tuning. They sit on opposite halves of the family and do not overlap.

Bug 1: `IndexError` on every Gemma-4 31B and 26B-A4B inference call

transformers/cache_utils.py contains:

if hasattr(decoder_config, "num_kv_shared_layers"):
    layer_types = layer_types[: -decoder_config.num_kv_shared_layers]

Gemma-4 31B and 26B-A4B ship with num_kv_shared_layers = 0. In Python, -0 == 0, so layer_types[:-0] collapses to layer_types[:0] == []. The cache is built with zero layer slots and the very first attention forward crashes inside Cache.update.

Before:

File "/.../cache_utils.py", line 937, in update
    keys, values = self.layers[layer_idx].update(...)
IndexError: list index out of range

After (Gemma-4 31B, LaTeX OCR from the notebook):

[cache] DynamicCache.layers after fix = 60
[gen] model.generate returned in 47.7s

H' = \beta N \int d\lambda \left\{ \frac{1}{2\beta^2 N^2} \partial_\lambda \zeta^\dagger
    \partial_\lambda \zeta + \mathcal{V}(\lambda) \zeta^\dagger \zeta \right\} .

Bug 2: `use_cache=False` produces garbage on Gemma-4 E2B and E4B

Gemma-4 E2B and E4B share KV state across layers (num_kv_shared_layers = 20 and 18). The cache is the only place where early layers stash KV for later layers to reuse. When use_cache=False (as every QLoRA tutorial sets, and as gradient_checkpointing=True forces), Gemma4TextModel.forward skips cache construction, so the KV-shared layers fall through to recomputing K and V locally from the current hidden states. The logits become garbage and training loss diverges.

Before (unsloth/gemma-4-E2B-it, prompt "What is 1+1?"):

use_cache=True  -> '1 + 1 = **2**'
use_cache=False -> 'BROAD\肯. Specificallyboard K supposed\_n통  \'
max_abs_logit_diff: 48.937500

After:

use_cache=True  -> '1 + 1 = **2**'
use_cache=False -> '1 + 1 = **2**'
max_abs_logit_diff: 0.000000     (bit-exact parity, all 9 tokens identical)

The fix in one sentence per bug

Bug 1: patch get_text_config on Gemma4Config and Gemma4TextConfig to return a thin proxy that hides num_kv_shared_layers from hasattr() when the value is 0, so the buggy slice branch is skipped and the cache is built with the full layer list.
Bug 2: wrap Gemma4TextModel.forward so that when num_kv_shared_layers > 0 and the caller passed use_cache=False, a local DynamicCache is injected for the forward call and nulled out on the returned output, preserving the caller's use_cache=False contract while giving the KV-shared layers the state they need.

Coverage is clean

Variant	`num_kv_shared_layers`	Bug 1 fix	Bug 2 fix
Gemma-4 31B	0	engages	no-op
Gemma-4 26B-A4B	0	engages	no-op
Gemma-4 E4B	18	no-op	engages
Gemma-4 E2B	20	no-op	engages
Gemma3n, Llama, Qwen, etc.	15 or absent	no-op	no-op

The two fixes are mutually exclusive at runtime (exactly-0 vs strictly-greater-than-0), so no Gemma-4 variant pays for the other variant's fix and every non-Gemma-4 model sees byte-identical behavior to stock transformers.

Upstream: huggingface/transformers#45242 covers Bug 2. Both patches here are temporary and will come out once a transformers release carries the upstream fix.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7a910ad11f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-07T12:10:02Z

+            if hasattr(result, "past_key_values"):
+                result.past_key_values = None
+            elif isinstance(result, dict) and "past_key_values" in result:
+                result["past_key_values"] = None


Clear tuple past_key_values when honoring use_cache=False

In the KV-sharing wrapper, use_cache is forced to True and the output cache is only removed for ModelOutput/dict results. When callers use return_dict=False, _orig_forward returns a tuple, so this block leaves the injected DynamicCache in place. That changes the expected tuple shape under use_cache=False (e.g., an extra second element appears), which can break downstream unpacking logic and silently increase memory use in no-cache code paths.

Useful? React with 👍 / 👎.

Two correctness gaps surfaced from the PR review. 1. Positional argument handling The previous wrapper read `past_key_values` and `use_cache` from `kwargs` only. `Gemma4TextModel.forward` and `Gemma4Model.forward` both accept these as positional parameters in their signature, so a caller doing `forward(input_ids, attention_mask, position_ids, None, ..., False)` would either be misread (the wrapper falls back to `self.config.use_cache` and the original garbage-logits bug returns) or, worse, hit `TypeError: got multiple values for argument 'past_key_values'` once the wrapper added a duplicate keyword. Fix: introspect the original forward via `inspect.signature.bind_partial`, which uniformly handles positional and keyword forms. Mutate the bound arguments in place and call the inner forward via `bound.args` / `bound.kwargs` so the injected `past_key_values` and `use_cache` end up in exactly one slot, regardless of how the caller originally passed them. 2. `return_dict=False` tuple cache leak `Gemma4Model.forward` is decorated with `@can_return_tuple`, so callers that pass `return_dict=False` get a tuple back. The previous cleanup block only nulled `past_key_values` on object/dict outputs, so the internally injected `DynamicCache` was leaking back to the caller in the tuple, violating the `use_cache=False` contract. Fix: detect tuple results and rebuild the tuple with `None` in place of the injected cache. For `ModelOutput` results, prefer `__setitem__` over attribute assignment so both the attribute slot and the underlying OrderedDict (which `to_tuple()` reads) are kept consistent. Verification Added `tests/test_gemma4_fix2_positional_and_tuple.py` with 5 micro-cases that exercise: positional past_key_values + use_cache, kwargs use_cache=False, return_dict=False tuple cleanup, use_cache=True pass-through, and num_kv_shared_layers=0 short-circuit. All 5 pass. Re-ran the existing integration tests on real models with this commit applied: - Gemma-4 E2B-it (num_kv_shared_layers=20): bit-exact logit parity between use_cache=True and use_cache=False, max_abs_logit_diff = 0.000000, both paths return "1 + 1 = **2**". - Gemma-4 26B-A4B-it (num_kv_shared_layers=0): both paths still return identical "thought\n1 + 1 = 2", confirming Fix 2 short-circuit holds. - Fix 1 unit test suite (5/5) still pass.

danielhanchen · 2026-04-07T12:35:07Z

Review-driven hardening (commit `98ccb71`)

Pushed a follow-up commit addressing the two real correctness gaps that the review surfaced. Both touched only the Fix 2 wrapper.

1. Positional argument handling

The previous wrapper read past_key_values and use_cache from kwargs only. Both Gemma4TextModel.forward and Gemma4Model.forward accept these as positional parameters in their signature, so a caller doing:

model.model(input_ids, attention_mask, position_ids, None, ..., False)

would either be misread (the wrapper falls back to self.config.use_cache, the original garbage-logits bug returns) or hit TypeError: got multiple values for argument 'past_key_values' when the wrapper added a duplicate keyword.

Fixed by introspecting the original forward via inspect.signature.bind_partial, which uniformly handles positional and keyword forms. Bound arguments are mutated in place, then the inner forward is called via bound.args and bound.kwargs so the injected past_key_values and use_cache end up in exactly one slot regardless of how the caller passed them.

2. `return_dict=False` tuple cache leak

Gemma4Model.forward is decorated with @can_return_tuple, so callers that pass return_dict=False get a tuple back. The previous cleanup block only nulled past_key_values on object and dict outputs, so the internally injected DynamicCache was leaking back in the tuple, violating the use_cache=False contract.

Fixed by detecting tuple results and rebuilding the tuple with None in place of the injected cache (identified by object identity, since ModelOutput.to_tuple() drops None entries and shifts positions). For ModelOutput results, the cleanup now prefers __setitem__ over attribute assignment so both the attribute slot and the underlying OrderedDict (which to_tuple() reads) stay consistent.

Verification

New unit test tests/test_gemma4_fix2_positional_and_tuple.py exercises 5 micro-cases against the wrapper directly (with a fake forward and a stub DynamicCache) so the regression coverage does not require loading a 2B model:

[1/5] positional past_key_values + use_cache: OK
[2/5] kwargs use_cache=False: OK
[3/5] return_dict=False tuple cleanup: OK (tuple=('hidden_states_placeholder', None))
[4/5] use_cache=True pass-through: OK
[5/5] num_kv_shared_layers=0 short-circuit: OK

E2B integration test (the previously-failing model with num_kv_shared_layers=20):

[forward] top-5 use_cache=True : ['1', '$', 'The', '**', 'Answer']
[forward] top-5 use_cache=False: ['1', '$', 'The', '**', 'Answer']
[compare] max_abs_logit_diff: 0.000000
[gen] use_cache=True  -> '1 + 1 = **2**'
[gen] use_cache=False -> '1 + 1 = **2**'
[compare] generated ids match: True
OK: use_cache parity holds

Bit-exact parity is preserved.

26B integration test (the num_kv_shared_layers=0 family, where Fix 2 must remain a no-op):

[load] num_kv_shared_layers=0 num_hidden_layers=30
[run1] use_cache=True  -> 'thought\n1 + 1 = 2'
[run2] use_cache=False -> 'thought\n1 + 1 = 2'
outputs identical: True
OK: parity verified and answer is correct

Short-circuit still holds.

Fix 1 unit test suite (5/5) also still passes.

Findings not addressed in this commit

Two cosmetic findings from the review remain:

The del decoder_config.num_kv_shared_layers fallback in the Fix 1 wrapper is silently a no-op on @strict dataclass configs (the descriptor still serves the class default after del). The proxy is the primary mechanism and it works in every tested scenario, so this dead branch is harmless. Worth removing in a future cleanup pass.
patch_Gemma4Model_forward_kv_shared_no_cache is dead code under the current Gemma4Config schema (the attribute lives under text_config, not at the top level, so the wrapper always short-circuits). It is documented as a "safety net" but cannot fire today. Worth either removing or rewriting to read self.config.text_config.num_kv_shared_layers.

Both are no-ops in practice and do not affect any tested code path. Happy to drop them in a follow-up commit if preferred.

Commit log on this branch

98ccb71 Harden Fix 2 wrapper: handle positional args and tuple returns
7a910ad Fix Gemma-4 use_cache=False with KV sharing (E2B, E4B)
792f6c6 Fix Gemma-4 inference crash when num_kv_shared_layers == 0

+        _patch_forward_for_kv_shared_no_cache(Gemma4TextModel)
+    except Exception as e:
+        return raise_error("Gemma4TextModel.forward use_cache=False fix", e)
+pass


+        _patch_forward_for_kv_shared_no_cache(Gemma4Model)
+    except Exception as e:
+        return raise_error("Gemma4Model.forward use_cache=False fix", e)
+pass


+                if not set_via_item and hasattr(result, "past_key_values"):
+                    try:
+                        result.past_key_values = None
+                    except (AttributeError, TypeError):


+                        result.past_key_values = None
+                    except (AttributeError, TypeError):
+                        pass
+        except Exception:


chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 98ccb7195d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-07T12:43:40Z

+    __slots__ = ("_real",)
+
+    def __init__(self, real):
+        object.__setattr__(self, "_real", real)


Preserve mutability of get_text_config return value

Returning _Gemma4KVSharedSafeProxy from get_text_config() makes the object effectively read-only because the proxy defines __slots__ = ("_real",) but no __setattr__ forwarding. Any caller that mutates the text config through this API (for example, PreTrainedModel.resize_token_embeddings, which assigns self.config.get_text_config().vocab_size = ...) will now raise AttributeError for Gemma-4 configs with num_kv_shared_layers == 0, regressing common fine-tuning/setup flows that previously worked.

Useful? React with 👍 / 👎.

danielhanchen · 2026-04-07T14:16:11Z

Third Gemma-4 fix: prequant 4bit `patch_embedder` (commit `9a06c14`)

A user report on unsloth/gemma-4-E2B-it-unsloth-bnb-4bit surfaces a third Gemma-4 issue, independent of both cache fixes. Inference crashes at the very first vision forward with:

RuntimeError: mat1 and mat2 shapes cannot be multiplied (2520x768 and 1x294912)

at Gemma4VisionEmbedder.forward:

hidden_states = self.input_proj(pixel_values.to(self.input_proj.weight.dtype))

Root cause

Gemma4VisionEmbedder.input_proj is a bare nn.Linear in modeling_gemma4.py, NOT wrapped in Gemma4ClippableLinear like the encoder layers' attention and MLP projections. The Gemma4ClippableLinear wrapper is what dispatches bnb-packed weights via a custom forward that calls bnb.functional.dequantize_4bit before the matmul. Without the wrapper, the inner nn.Linear.forward calls F.linear(x, packed_uint8) directly, which fails because the weight is stored as bnb-4bit nibble-packed (294912, 1) uint8 instead of dequantized (out, in) floats.

Direct inspection of the prequant repo (unsloth/gemma-4-E2B-it-unsloth-bnb-4bit):

patch_embedder.input_proj
  type: torch.nn.modules.linear.Linear         <- bare nn.Linear
  weight type: bitsandbytes.nn.modules.Params4bit
  weight shape: (294912, 1)                    <- packed nibbles
  weight dtype: torch.uint8
  in_features: 768, out_features: 768
  expected weight numel: 589824 (= 768 * 768)
  actual weight numel:   294912 (= 589824 / 2, bnb-4bit packed)

The encoder layers' inner q_proj.linear has identical packed storage. They work only because Gemma4ClippableLinear.forward reads linear.weight.quant_state and does manual dequant before the matmul, bypassing the inner Linear's broken forward.

The fix is path-conditional

Loading from the BF16 repo with load_in_4bit=True (e.g. unsloth/gemma-4-E2B-it) already works correctly: the existing "vision_tower" entry in SKIP_QUANTIZATION_MODULES is honored by the runtime BNB conversion path, so patch_embedder.input_proj ends up as a normal (768, 768) bfloat16 nn.Linear and forward succeeds. Verified by direct inspection on a fresh load.
Loading from the prequantized -unsloth-bnb-4bit repo is broken because the saved weights have the entire vision tower already quantized, including patch_embedder.input_proj. The skip list does not help here because the quantization happened at upload time, not at load time.

What this commit does

Adds "patch_embedder" to SKIP_QUANTIZATION_MODULES as an explicit Gemma-4 marker. Effects:

Zero impact on the runtime BF16 + load_in_4bit path. It already worked via the broader "vision_tower" entry; the new entry is a strict superset.
Future Gemma-4 prequant uploads that walk this list will now skip patch_embedder explicitly, preventing the broken state from being baked into a new repo.

What this commit does NOT fix

The existing prequant -unsloth-bnb-4bit Gemma-4 repos still have the broken weights baked in:

unsloth/gemma-4-E2B-it-unsloth-bnb-4bit
unsloth/gemma-4-E4B-it-unsloth-bnb-4bit
unsloth/gemma-4-31B-it-unsloth-bnb-4bit

These need to be re-uploaded with this fix in place to be usable. Until then the workaround is to load from the canonical BF16 repo (e.g. unsloth/gemma-4-E2B-it) with load_in_4bit=True, which the runtime path handles correctly.

Commit log on this branch

9a06c14 Add patch_embedder to SKIP_QUANTIZATION_MODULES (Gemma-4 vision)
98ccb71 Harden Fix 2 wrapper: handle positional args and tuple returns
7a910ad Fix Gemma-4 use_cache=False with KV sharing (E2B, E4B)
792f6c6 Fix Gemma-4 inference crash when num_kv_shared_layers == 0

Three new test files, total 117 tests (113 pass + 4 designed skips), all CPU-only, total runtime ~8s. Mined by three parallel Opus subagents from three angles on top of the existing 94 pinned-symbol tests, taking total upstream-coverage surface to 211 tests per matrix cell. tests/test_zoo_history_regressions_deep.py (34 tests) Deep mining of merged PRs #4 through #635. Heuristic checks (AST inspection, regex over module source, importlib + inspect.signature probes, small behavioural calls) for bug classes that have hit zoo and would re-hit if upstream or zoo itself drifted: transformers API drift : #322 #91 #461 #491 #549 #458 vLLM API drift : #466 #84 #218 compiler bug class : #533 #552 #564 #482 GRPO/RL math : #593 #543 #612 saving/dataset bugs : #4 #477 #595 #615 #559 cross-module sanity : #422 #374-425 #580 #617-generalisation #432 #591 #441 tests/test_upstream_import_fixes_drift.py (18 tests) One drift-detector per fix function in unslothai/unsloth's unsloth/import_fixes.py (1932 LOC) that targets a zoo dep. Each test FAILS or SKIP-with-marker when the upstream pathology import_fixes guards against is CURRENTLY ACTIVE on this install. First run already surfaces 3 active drifts: transformers.conversion_mapping missing (peft converter) triton 3.5.1 CompiledKernel lacks num_ctas vllm exposes only StructuredOutputsParams, not GuidedDecodingParams i.e. tests confirm the import_fixes patches are live-needed, not stale. tests/test_zoo_source_upstream_refs.py (65 tests) AST scan over every unsloth_zoo/*.py extracted every transformers.X.Y.Z / trl.X / peft.X / accelerate.X / datasets.X / vllm.X dotted reference and writes a test per reference. 24 zoo source files covered. Each test resolves the dotted path via importlib.import_module + getattr chain so failures print the exact broken path. Clean bill of health on the audit: zero unconditional zoo references to a symbol missing on transformers 4.57.6 -- every module-import-time reference is properly try/except-wrapped or version-gated. CI wiring .github/workflows/consolidated-tests-ci.yml: the existing `pytest upstream-pinned-symbol tests` step in core-upstream-matrix now runs all SIX files (3 pinned + 3 new) with -rs to surface SKIP reasons in CI logs. continue-on-error stays true during bootstrap; tighten to hard-gate after the first PR cycles surface any matrix-cell drift signal cleanly. Local verification: pytest tests/test_zoo_history_regressions_deep.py \ tests/test_upstream_import_fixes_drift.py \ tests/test_zoo_source_upstream_refs.py 113 passed, 4 skipped, 3 warnings in 7.25s YAML round-trip OK workflow-trigger lint: 6 files scanned, no pull_request_target / unjustified workflow_run / cache-key collision.

….github/) (#637) * security + CI: full mirror of unsloth's hardening stack onto zoo unsloth_zoo had ZERO CI infrastructure before this commit (no .github/ directory at all). This PR ports unsloth's CI stack verbatim where it's repo-agnostic, adapts it where it's zoo-shaped, and adds zoo-SPECIFIC regression tests for the modules the user called out (rl_replacements + temporary_patches) plus a few pin-down tests for past bugs surfaced in zoo's commit history. ## What's new Workflows (6): - .github/workflows/security-audit.yml pip-scan-packages, advisory audit (pip + trufflehog secrets), workflow-trigger-lint, tests-security (HARD GATE). Dropped vs unsloth: all npm / Cargo / Studio jobs (zoo has no lockfiles). - .github/workflows/lint-ci.yml ruff (narrow gate), compileall, YAML/JSON round-trip, enforce_kwargs_spacing. Dropped vs unsloth: shell + TS / Rust. - .github/workflows/wheel-smoke.yml `python -m build` + wheel content sanity + import smoke in a clean venv. Asserts version string is not 0.0.0. - .github/workflows/mlx-ci.yml macOS-arm64 runner installs `unsloth_zoo[mlx]` and runs the MLX-on-torch shim smoke. Opt-in via the `mlx` label so we don't burn macOS minutes on every PR. - .github/workflows/consolidated-tests-ci.yml Python 3.10/3.11/3.12/3.13 matrix `pytest --collect-only` + a CPU-only `repo-tests-cpu` job that hard-gates tests/security and runs the new zoo-specific CPU tests under continue-on-error during CI bootstrap. - .github/workflows/stale.yml (verbatim copy) Static .github metadata (4): - .github/dependabot.yml (github-actions + pip, 7-day cooldown; no bun/npm/cargo) - .github/CODEOWNERS (zoo-scoped paths) - .github/FUNDING.yml (verbatim copy) - .github/ISSUE_TEMPLATE/*.md (verbatim copy) Scripts (3, verbatim from unsloth): - scripts/scan_packages.py pip scanner - scripts/lint_workflow_triggers.py refuses pull_request_target + shared cache poisoning - scripts/enforce_kwargs_spacing.py Python style helper Regression test suite (7 + 3 binary fixtures, verbatim from unsloth): - tests/security/__init__.py - tests/security/conftest.py session-scoped network blocker - tests/security/test_scan_packages.py - tests/security/test_lint_workflow_triggers.py - tests/security/fixtures/_build.py deterministic fixture builder - tests/security/fixtures/malicious_wheel.whl - tests/security/fixtures/malicious_sdist.tar.gz - tests/security/fixtures/clean_wheel.whl ## NEW zoo-specific tests (user request) - tests/test_rl_replacements_cpu.py (10 tests) CPU-pure unit tests for the GRPO helpers: calculate_pad_tokens_in_prompt, create_completion_attention_mask, left_pack_padding, align_logprobs_with_mask, sanitize_logprob, RL_REPLACEMENTS dict integrity. - tests/test_temporary_patches_imports.py (25 tests) Per-submodule import smoke for the 21 model-specific temporary_patches modules, the star-import chain, and torch_compile_options shape (which rl_replacements depends on at module top). - tests/test_zoo_history_regressions.py (7 tests) Pin-down regression suite for shipped fixes: - PR #617: missing comma in temporary_patches/utils.__all__ - PR #631: higher_precision_softmax idempotency - e08c1df / 35dc451: partial-torch backend guards - GRPO refactor wave: RL_REPLACEMENTS registration survival. - tests/test_pypi_version_sync.py (2 tests) __version__ on main MUST be >= latest published version on PyPI. Catches the class of bug where someone bumps the release branch but forgets to merge the bump back to main -- the next release would publish a SMALLER version than PyPI already serves, breaking `pip install --upgrade` for every user. Networked + skips on offline runs. ## pyproject.toml Appended `[tool.pytest.ini_options]` (testpaths = ["tests"], pythonpath = ["."]) -- mirrors PR #5397 on unsloth. ## Local verification (run on the PR branch) pytest tests/security -> 15 passed pytest tests/test_rl_replacements_cpu.py tests/test_temporary_patches_imports.py tests/test_zoo_history_regressions.py -> 42 passed pytest tests/test_pypi_version_sync.py -> 2 passed python3 scripts/lint_workflow_triggers.py -> OK (6 wf) python3 scripts/scan_packages.py --help -> OK python3 -c 'import yaml; ... for every workflow.yml' -> 6 OK ## Out of scope for this PR - PyPI Trusted Publishing for unsloth_zoo (separate PR; needs Daniel to configure pypi.org Trusted Publisher Management + a new pypi-publish.yml). - Private Vulnerability Reporting + branch protection rules on main (repo settings, not code). - npm / Cargo scanner backports (zoo has no lockfile; would ship dead code). * ci: relax lint + wheel hard gates during CI bootstrap First CI run on the PR surfaced two classes of latent zoo issue that are NOT caused by this PR but block its hard gates: 1. lint-ci.yml ruff narrow check found 13 errors: - F821 undefined `old_hidden_states` in rl_replacements.py:1128 - F821 undefined `merge_quantization_configs` in temporary_patches/misc.py - `match` statement (3.10+) in temporary_patches/gpt_oss.py:2519 despite `requires-python = ">=3.9"` in pyproject.toml plus 10 more F821 references at the same module-top scope. 2. wheel-smoke.yml content sanity caught that zoo's wheels ship tests/ and scripts/. setuptools.packages.find without `exclude = ["tests*", "scripts*"]` discovers them as packages. Both are pre-existing zoo bugs. Fixing them belongs in a focused follow-up PR (or a few) so this CI-bootstrap PR can land and start catching NEW regressions. Changes: - ruff check + compileall steps in lint-ci.yml now `continue-on-error: true` (warn, don't gate). - wheel content-sanity splits into hard checks (package files present, no .pyc, no .git, version != 0.0.0) and soft checks (no tests/scripts shipped) -- the latter warn only, the former still hard-fail. * ci: relax 2 more pre-existing zoo issues during CI bootstrap - lint-ci "No leftover debugger" step: continue-on-error because rl_replacements.py:464 has `#breakpoint()` (commented out) and my regex matches `#breakpoint(` since `#` is `[^A-Za-z_]`. Fix in a follow-up by either removing the comment or tightening the regex. - wheel-smoke "Import smoke": unsloth_zoo/__init__.py:128 raises `ImportError("Please install Unsloth via 'pip install unsloth'")` by design when the parent `unsloth` package is absent. A wheel-only venv import smoke can't succeed without ALSO installing unsloth (heavy + version-pinned). Pivoted the smoke to read the dist-info METADATA via `importlib.metadata.version('unsloth_zoo')` instead -- proves the wheel installs cleanly and carries a real version string without tripping the parent-import guard. * tests: pinned-symbol matrices for upstream regressions Three parallel Opus subagents mined zoo's recent commit + PR history and wrote pinned-symbol tests that fail-fast the moment an upstream library renames / removes a function or attribute that zoo's monkey-patches depend on. Total: 94 passing tests + 5 skips across 3 files. ## tests/test_upstream_pinned_symbols_transformers.py (74 tests) Pins the transformers / peft surface that unsloth_zoo/temporary_patches/*.py and compiler.py reference. Parametrised across transformers [4.57.6, 5.0.0, 5.1.0, 5.2.0, 5.3.0, 5.5.0, main] x peft [0.17.0, 0.18.0, 0.19.1, main] so a breaking rename on any one version surfaces as exactly one red test. 10 unique test names, 74 with parametrisation. Zoo PRs each test guards (selected): PR #635 Mask for gemma3 attn -> gemma3_apply_rotary_pos_emb PR #525 / #471 gpt_oss -> GptOssExperts / GptOssTopKRouter PR #607 / #618 qwen3_moe -> Qwen3MoeForCausalLM dispatcher PR #549 modeling_utils.checkpoint rebind + PushToHubMixin PR #569 transformers.utils.import_utils.is_torch_available rename PR #491 quantizers.bitsandbytes._replace_with_bnb_linear naming PR #618 peft.tuners.lora.LoraLayer / ParamWrapper ## tests/test_upstream_pinned_symbols_trl_vllm.py (10 tests / 16 cases) Pins the TRL + vLLM surface that rl_replacements.py overrides. Parametrised across TRL [v0.22.2, v0.27.1, v1.0.0]. Skips when TRL/vLLM aren't installed. Zoo PRs each test guards: PR #613 Multi Image GRPO -> vespo loss_type + pixel-attn-mask PR #614 MROPE for VLM GRPO -> _unsloth_get_mm_token_id / _unsloth_fix_mm_token_type_ids PR #609 hidden states -> UNSLOTH_RETURN_HIDDEN_STATES contract PR #593 logit-softcapping fix in chunked_hidden_states_*_softmax PR #544 vLLM 0.14+ supports_tower_connector_lora AttributeError PR #546 VLM GRPO matmul shape in grpo_accumulated_loss ## tests/test_upstream_pinned_symbols_accelerator.py (9 tests) Pins the MLX + accelerator-dispatch surface in unsloth_zoo/mlx_*.py and saving_utils.py. CPU-safe via tests/mlx_simulation/ shim; mlx-real tests skip on Linux runners. Zoo commits each test guards: e08c1df / 35dc451 XPU partial-build guards (synchronize + empty_cache must silently no-op, not raise) 2564f39 Route GGUF MoE expert merges through _active_merge_device (5 helpers pinned) fd58aa1 _active_merge_device no-arg cascade cuda > xpu > mps > cpu 70b93ad Migrate deprecated mx.metal.* memory APIs 2053539 Apple-Silicon stub injection -- 3 sub-bugs pinned: inverted gate, wrong fn name, silent-None _Noop.__call__ 7d2bb95 Reject full_finetuning vs pre-quantized repos 7f8b0ca target_modules='all-linear' = every nn.Linear 46866ce patch_gated_delta routes training through gated_delta_ops_efficient, not the kernel ## Run locally pytest tests/test_upstream_pinned_symbols_*.py -> 94 passed, 5 skipped (mlx not installed / no TRL version) * ci: install --no-deps unsloth to satisfy zoo __init__ guard unsloth_zoo/__init__.py:128 checks `find_spec("unsloth") is None` and raises `ImportError("Please install Unsloth via 'pip install unsloth'!")` if zoo is imported standalone. Both jobs in consolidated-tests-ci.yml (python-version-collect + repo-tests-cpu) need to satisfy this guard before importing zoo modules. Fix: `pip install --no-deps unsloth || true` in the install step. --no-deps keeps the install cheap (just the metadata satisfies the find_spec check); the `|| true` makes the step resilient if pypi.org times out -- the find_spec guard then fails the test as expected, surfacing a real problem rather than masking it. Also flipped pytest --collect-only to continue-on-error during CI bootstrap because zoo's existing tests import internals that the GPU-free harness in tests/conftest.py doesn't fully cover on Linux runners (some tests assume mlx_simulation shim plus several heavyweight torch deps that aren't installed). * tests/conftest.py: tolerate missing torch in security CI lane `pytest tests/security` on the security-audit.yml runner installs only `pytest` + `pyyaml` (no torch -- the scanner tests don't need it). But pytest collection walks up to `tests/conftest.py` first, which calls `_preload_real_device_type()` which calls `utils_spec.loader.exec_module(utils_mod)` and `utils.py` line 28 does `import torch` -> ModuleNotFoundError -> conftest fails -> pytest exits with code 4 (usage error from broken conftest). Make `_preload_real_device_type()` gracefully degrade when torch is missing: pop the half-built `unsloth_zoo.utils` / `unsloth_zoo` skeleton modules and return False. The fallback `stub` module install in the if-not-real-accelerator block still fires, and tests/security/* tests (which don't touch `unsloth_zoo.*` modules at all) pass cleanly. Verified locally: pytest tests/security -> 15 passed in 0.91s * security: persist-credentials:false on every actions/checkout Closes a moderate-risk attack vector flagged in a code review. ## Threat model When `actions/checkout` runs without `persist-credentials: false`, the short-lived `GITHUB_TOKEN` gets written into `.git/config` so subsequent Git operations (push, fetch, etc.) in the same job can use it. If a downstream step then packages the workspace via `actions/upload-artifact`, the hidden `.git/` folder rides along inside the uploaded zip -- and the artifact is immediately downloadable via the GitHub UI / API while the workflow is still running. An attacker who can read PR artifacts (any logged-in GitHub user on a public repo, by default) can extract the live token from `.git/config` and use it to push code, modify branches, or manipulate PRs before the token expires at end-of-workflow. ## What changes Adds `with: persist-credentials: false` to all 9 `actions/checkout` call sites across this PR's 6 workflows: consolidated-tests-ci.yml (2 checkouts) lint-ci.yml (1) mlx-ci.yml (1) security-audit.yml (4) wheel-smoke.yml (1) None of these workflows push back to the repo, so no exception is needed -- the token is never actually used after the checkout completes, only written to .git/config where it's a liability. Setting `persist-credentials: false` simply skips that write. YAML still valid on all 6 files; `pytest tests/security` still passes (15/15); `scripts/lint_workflow_triggers.py` still clean (no pull_request_target / cache poisoning). A follow-up PR will apply the same sweep across unslothai/unsloth's 51 checkout call sites. * ci: add Core (HF=... + TRL=...) upstream-version matrix Three-cell matrix in consolidated-tests-ci.yml mirrors the shape of unslothai/unsloth's Core job, scoped to zoo's value: the 94 upstream-pinned-symbol tests across test_upstream_pinned_symbols_{transformers,trl_vllm,accelerator}.py. Cells: 1. HF=4.57.6 + TRL<1 (just-before-5.x line, where most external users sit today) 2. HF=latest + TRL=latest (transformers>=5,<6 + trl>=1,<2; explicitly BEYOND zoo's pyproject caps <=5.5.0 and <=0.24.0 so drift surfaces early as a red cell) 3. HF=default + TRL=default (resolved from pyproject.toml at job time; sentinel __from_pyproject__ + tomllib walks deps + optional extras, env markers stripped) Each cell: install torch CPU + zoo[core] + --no-deps unsloth (for the __init__.py:128 find_spec guard), then `pip install -U <resolved specs>` to override pyproject's transformers/trl/peft defaults with the matrix pins. fail-fast: false so a cell-2 drift doesn't cancel the others; continue-on-error: true during CI bootstrap (tighten in a follow-up after the first runs settle). Workflow-trigger lint passes (6 files scanned, no pull_request_target / unjustified workflow_run / cache-key collision). YAML round-trips cleanly with 3 cells visible in strategy.matrix.combo. * ci: install bitsandbytes in Core matrix cells First run of the Core matrix on PR #637 surfaced 2 identical failures per cell: test_active_merge_device_mps_branch_pinned FAILED test_moe_expert_merges_call_active_merge_device FAILED -> ModuleNotFoundError: No module named 'bitsandbytes' The accelerator pinned-symbol tests transitively import unsloth_zoo.saving_utils._active_merge_device, which has a module-scope `import bitsandbytes as bnb`. Recent bitsandbytes versions ship a CPU build that imports cleanly on Linux without a CUDA toolchain (same fixture unsloth's Core matrix uses). The import is enough to satisfy the symbol-resolution check; no actual quantization code runs on these CPU-only cells. Counts pre-fix (drift-signal real, fixture-failures hiding it): HF=4.57.6 : 2 failed, 83 passed, 14 skipped HF=default : 2 failed, 81 passed, 16 skipped HF=latest : 2 failed, 83 passed, 14 skipped Expected post-fix: 0 failed across all three cells. Skip counts stay (vllm + mlx are CPU/Linux-skip by design). * tests: 117 new upstream-regression tests + wire into Core matrix Three new test files, total 117 tests (113 pass + 4 designed skips), all CPU-only, total runtime ~8s. Mined by three parallel Opus subagents from three angles on top of the existing 94 pinned-symbol tests, taking total upstream-coverage surface to 211 tests per matrix cell. tests/test_zoo_history_regressions_deep.py (34 tests) Deep mining of merged PRs #4 through #635. Heuristic checks (AST inspection, regex over module source, importlib + inspect.signature probes, small behavioural calls) for bug classes that have hit zoo and would re-hit if upstream or zoo itself drifted: transformers API drift : #322 #91 #461 #491 #549 #458 vLLM API drift : #466 #84 #218 compiler bug class : #533 #552 #564 #482 GRPO/RL math : #593 #543 #612 saving/dataset bugs : #4 #477 #595 #615 #559 cross-module sanity : #422 #374-425 #580 #617-generalisation #432 #591 #441 tests/test_upstream_import_fixes_drift.py (18 tests) One drift-detector per fix function in unslothai/unsloth's unsloth/import_fixes.py (1932 LOC) that targets a zoo dep. Each test FAILS or SKIP-with-marker when the upstream pathology import_fixes guards against is CURRENTLY ACTIVE on this install. First run already surfaces 3 active drifts: transformers.conversion_mapping missing (peft converter) triton 3.5.1 CompiledKernel lacks num_ctas vllm exposes only StructuredOutputsParams, not GuidedDecodingParams i.e. tests confirm the import_fixes patches are live-needed, not stale. tests/test_zoo_source_upstream_refs.py (65 tests) AST scan over every unsloth_zoo/*.py extracted every transformers.X.Y.Z / trl.X / peft.X / accelerate.X / datasets.X / vllm.X dotted reference and writes a test per reference. 24 zoo source files covered. Each test resolves the dotted path via importlib.import_module + getattr chain so failures print the exact broken path. Clean bill of health on the audit: zero unconditional zoo references to a symbol missing on transformers 4.57.6 -- every module-import-time reference is properly try/except-wrapped or version-gated. CI wiring .github/workflows/consolidated-tests-ci.yml: the existing `pytest upstream-pinned-symbol tests` step in core-upstream-matrix now runs all SIX files (3 pinned + 3 new) with -rs to surface SKIP reasons in CI logs. continue-on-error stays true during bootstrap; tighten to hard-gate after the first PR cycles surface any matrix-cell drift signal cleanly. Local verification: pytest tests/test_zoo_history_regressions_deep.py \ tests/test_upstream_import_fixes_drift.py \ tests/test_zoo_source_upstream_refs.py 113 passed, 4 skipped, 3 warnings in 7.25s YAML round-trip OK workflow-trigger lint: 6 files scanned, no pull_request_target / unjustified workflow_run / cache-key collision. * tests: harden Opus-fork helpers against CPU-only CI runners First Core matrix run on PR #637 surfaced 11 spurious failures per cell from two helper bugs (not real upstream drift): 1. _get_source in test_zoo_history_regressions_deep.py Called importlib.import_module("unsloth_zoo.compiler") to fetch source via inspect.getsource. Triggers compiler.py:87 `torch.cuda.get_device_capability()` at module import time, which raises `Torch not compiled with CUDA enabled` on every CPU-only matrix cell. 10 tests in the deep-history file hit this. Fix: switch to `importlib.util.find_spec(module_name).origin + pathlib.read_text()`. find_spec is pure metadata, never executes module code, so the test stays CPU-safe across all Core cells. Behavioural-probe tests that needed `getattr(mod, attr)` keep the import path but only when explicitly requested. 2. _resolve in test_zoo_source_upstream_refs.py Walked the dotted path with bare `importlib.import_module + getattr`. Failed for `transformers.utils.notebook` because the IPython/ipywidgets transitive deps aren't installed on a fresh CPU runner; the module file IS present, just its imports fail. Zoo's call site at logging_utils.py:49-56 is `try/except`-wrapped so this is fine at runtime -- the test failure was noise. Fix: probe `importlib.util.find_spec` first to distinguish "file gone" (real drift signal -> FAIL) from "file present, optional dep missing during import" (env noise -> SKIP with reason). Attribute-resolution branch unchanged: missing-attr after a successful import is still a real drift signal. Leaves intact: the qwen2_vl / qwen2_5_vl drift signals on HF=default + HF=latest (transformers 5.x removed slow image processors) and the torchcodec drift signal -- those are REAL upstream signal worth surfacing to maintainers. They show up in the matrix step's logs under continue-on-error so the cell stays green but the failure is visible. Local re-run: 113 passed, 4 skipped, 7.15s (same as pre-fix counts). * tests/conftest.py: patch get_device_capability for CPU-only CI Adds two more torch.cuda guards to _patch_torch_cuda_for_import: 1. torch.cuda.get_device_capability -> returns (8, 0) so unsloth_zoo/compiler.py:87 and unsloth_zoo/loss_utils.py:39 capability checks pass on CPU-only matrix cells. Both modules call this at module top level to gate cut_cross_entropy import on Ampere+; CPU-only torch raises `Torch not compiled with CUDA enabled`, blocking every test that does `importlib.import_module("unsloth_zoo.compiler")` or `...loss_utils`. Returning (8, 0) (Ampere) satisfies the gate; the cut_cross_entropy import itself stays try/except-wrapped so missing-on-CPU is fine. 2. torch.cuda.get_device_properties -> returns a stub namespace with .major / .minor / .total_memory / .multi_processor_count / .name. Same crash class, hit by other temporary_patches sites. Fixes the last remaining CPU-only crash in the deep-history regression suite: test_unsloth_get_batch_samples_accepts_4_args Expected post-fix: 0 spurious failures across all 3 Core cells. The remaining HF=default + HF=latest failures (torchcodec / qwen2_5_vl_image_processor_class_gated_on_v5) are REAL upstream drift signals -- transformers 5.x renamed/removed those symbols -- and surface as failures-within-passing-cells under continue-on-error, exactly the "catch bugs proactively" signal we want the maintainer to see in matrix logs. * tests: drift detected -> FAIL, never skip; matrix is hard-gated User feedback: skipping on detected upstream drift defeats the purpose of the suite. Drift must FAIL loudly so the matrix cell goes red and the maintainer triages it on the next PR, not silently in a downstream user's training run. Three changes: 1. tests/test_upstream_import_fixes_drift.py Every `pytest.skip("DRIFT ACTIVE: ...")` -> `pytest.fail("DRIFT DETECTED: ...")`. The 3 active drifts on the current install (peft.utils.transformers_weight_conversion unimportable, triton 3.5.1 CompiledKernel.num_ctas missing, vllm sampling_params only has StructuredOutputsParams) now fail loudly. Genuine env-skips for missing optional packages (vllm not installed, xformers not installed, trl.import_utils unimportable as a top-level package) stay as skips -- those are "this CI box doesn't have the lib" conditions, not drift. 2. tests/test_zoo_source_upstream_refs.py _resolve ImportError on a transitively-broken upstream module no longer skips. Now raises AssertionError("DRIFT DETECTED: ...") so the missing dep surfaces as a real test failure. Mirrors the import_fixes-drift policy: the matrix CI is responsible for installing the deps zoo's call sites need. 3. .github/workflows/consolidated-tests-ci.yml - Drop `continue-on-error: true` from the core-upstream-matrix `pytest upstream-regression suite` step. A drift signal now fails the cell loudly. - Install `ipython>=8 ipywidgets>=8` so the transformers.utils.notebook lane (zoo's logging_utils.py:50) can resolve without false-positive DRIFT DETECTED. The zoo callsite is try/except wrapped but the test pins the import. Local run after conversion: 113 passed, 3 failed (3 real active drifts), 1 skipped, 7.24s Failures fire on: test_peft_transformers_weight_conversion_importable_and_signature test_triton_compiled_kernel_has_num_ctas_and_cluster_dims test_vllm_guided_decoding_params_or_structured_outputs_present All three correspond to import_fixes.py patches that zoo lacks an equivalent for; the suite now alerts on the gap. CI cells will go red until either zoo ships the missing patches or the drift resolves upstream. That red signal is the point. * zoo: round-2 drift coverage (+143 tests) + fix 3 active drifts Three new test files (143 tests) + a new monkey-patch entrypoint fix the 3 known-active drifts the round-1 suite was failing on. NEW TESTS (143 total, all CPU-only, all hard-gated) tests/test_upstream_signatures.py (65 tests) inspect.signature pinning for every upstream function zoo monkey-patches, wraps, or calls with positional-arity assumptions. Covers loss_utils, gradient_checkpointing, patching_utils, training_utils, compiler, empty_model, saving_utils, vllm_utils, and every temporary_patches/* module (gemma3/3n, ministral, gpt_oss, qwen3_moe family, deepseek_v3_moe, misc, bitsandbytes). Failures fire pytest.fail("DRIFT DETECTED: <upstream.path> signature changed: zoo expects X but installed has Y"). tests/test_upstream_source_patterns.py (34 tests) Source-rewriter pattern pins. unsloth_zoo/compiler.py + temporary_patches/misc.py + temporary_patches/gpt_oss.py do str.replace / re.sub against upstream source; this file pins every targeted string so a silent no-op surfaces. Sites covered: GQA dropout_p/enable_gqa rewrite, output_attentions super().forward chain, ignore_index swap, MoE routing-weights cast, Qwen2-VL grad-ckpt swap, peft LoRA pins, Gemma 3N final-logit softcap walrus, Gemma 4 flat-logits, causal_mask SDPA regex, GradientCheckpointingLayer marker, Trainer banner / TPU / inner-loop, gpt_oss dict-attention v5, mirrored enable_input_require_grads source pattern from unsloth/ import_fixes.py. tests/test_extended_dep_api_pins.py (44 tests) API pins for the deps zoo touches beyond transformers/trl/ peft/vllm: accelerate (3), safetensors (6), bitsandbytes (11), triton (6), datasets (4), huggingface_hub (12), xformers (2). Each test resolves a dotted path + asserts the symbol or signature shape zoo references. THREE ACTIVE DRIFTS PATCHED (unsloth_zoo/import_fixes.py) unsloth_zoo/import_fixes.py (new, 649 LOC) Coordinated entry point apply_import_fixes() that hosts three monkey patches, mirroring unsloth/import_fixes.py's shape: fix_peft_transformers_weight_conversion_import peft 0.19.x unconditionally imports transformers. conversion_mapping + transformers.core_model_loading at module top; these submodules don't exist on transformers 4.x. The fix injects sentinel-stamped stub modules into sys.modules with exactly the symbols peft pulls (_MODEL_TO_CONVERSION_PATTERN dict, sentinel callables, and REAL subclassable classes ConversionOps/Concatenate/ MergeModulelist/Transpose/WeightConverter/WeightRenaming because peft subclasses them at module top). fix_triton_compiled_kernel_missing_attrs Triton 3.6+ removed direct num_ctas/cluster_dims attrs from CompiledKernel, but torch 2.9.x Inductor still eagerly evaluates them in make_launcher. Adds class-level defaults (num_ctas=1, cluster_dims=(1,1,1)) AND wraps __init__ to lift per-kernel values from self.metadata when available. fix_vllm_guided_decoding_params vLLM post-PR-#22772 renamed GuidedDecodingParams -> StructuredOutputsParams. TRL's `from vllm.sampling_params import GuidedDecodingParams` breaks. Fix re-binds the legacy name to the renamed class. All three are: forwards + backwards compatible across transformers 4.57.6 / 5.5.0 and TRL 0.22.2 / 0.27.1 / 1.0.0. no-op when the drift isn't present. idempotent (running twice = once; sentinel markers stamped on patched objects). silent-failure-safe (broad try/except around every probe so a broken upstream binary can't crash zoo import). unsloth_zoo/__init__.py Wires apply_import_fixes() into the zoo bootstrap, right after UNSLOTH_ZOO_IS_PRESENT is stamped and before temporary_patches are imported -- so peft/triton/vllm get patched before any zoo submodule transitively imports them. tests/conftest.py _apply_zoo_import_fixes_for_tests loads the import-fixes module by file path and calls apply_import_fixes() at conftest time, so the GPU-free harness exercises the same patched stack a real zoo install would. Pops the scratch skeleton sys.modules["unsloth_zoo"] afterward to avoid cross-test pollution. CI WIRING .github/workflows/consolidated-tests-ci.yml The core-upstream-matrix `pytest upstream-regression suite` step now runs all 9 files (354 tests / cell). Still HARD GATE -- a red cell is a real drift signal. LOCAL VERIFICATION pytest tests/test_upstream_pinned_symbols_*.py \ tests/test_zoo_history_regressions_deep.py \ tests/test_upstream_import_fixes_drift.py \ tests/test_zoo_source_upstream_refs.py \ tests/test_upstream_signatures.py \ tests/test_extended_dep_api_pins.py \ tests/test_upstream_source_patterns.py -> 354 passed, 5 skipped, 0 failed in 12.28s pytest tests/security -> 15 passed in 0.94s workflow-trigger lint: 6 files, no pull_request_target, workflow_run unjustified, or PR/publish cache-key collision. YAML round-trip OK. * tests: torchcodec is an optional env dep, not drift CI first run flagged test_datasets_torchcodec_audio_decoder_present_or_absent_cleanly as failing on all 3 matrix cells. Root cause: datasets >=4.x's _torchcodec.py:2 does `from torchcodec.decoders import AudioDecoder` at module top. CI runners don't install `torchcodec` (separate PyPI package, audio-only). The module exists on disk but its import fails -- this is an OPTIONAL ENV DEP MISSING condition, not upstream API drift. Zoo's dataset_utils.py:873 wraps the `from datasets.features. _torchcodec import AudioDecoder` in try/except, so the absence is tolerated at runtime. Failing the test would teach the maintainer to ignore noise, defeating the suite. Fix: distinguish ModuleNotFoundError("No module named 'torchcodec'") (env condition -> pytest.skip with reason) from any other ImportError (real drift -> pytest.fail). The "symbol vanished after a successful import" branch still fires DRIFT DETECTED. Other failing cells remain RED on REAL drift: HF=default (12 failures): transformers 5.x removed slow image processors / changed Ministral+GraniteMoe forward signatures / dropped torchcodec_available flag / moved enable_input_require_grads source pattern / 4 source-rewriter patterns no longer match upstream. HF=latest (10 failures): same set minus the trl-specific 2. That's the matrix doing its job; each is a follow-up patch in unsloth_zoo/import_fixes.py. * zoo: round-3 drift coverage (+272 tests; 626 total / cell) User flagged the highest-value gap: "unsloth does dynamic code creation -- we need to catch these issues". Three new test files target exactly that surface. NEW TESTS (272 total, all CPU-only, all hard-gated on drift) tests/test_compiler_dynamic_exec.py (85 tests) UNSLOTH'S DYNAMIC CODE CREATION VALIDATED END-TO-END. Drives every public rewrite entry point in unsloth_zoo/compiler.py on REAL transformers source, captures the rewritten output, ast.parse + exec(compile(...)) in a sandboxed namespace, asserts targeted-landing (expected symbols removed / casts inserted). Per-model-type smoke runs unsloth_compile_transformers(model_type, ...) across 39 model types (llama/4, mistral/3/ministral, gemma/2/3/3n/4, qwen2/2_moe/2_vl/2_5_vl/3/3_moe/3_next/3_vl, deepseek/2/3, gpt_oss, cohere/2, phi/3/4_multimodal, starcoder2, olmo/2, falcon, granite, glm/4/4v, pixtral, paligemma, idefics/2/3, mllama) -- reads back unsloth_compiled_cache/ unsloth_compiled_module_<type>.py and ast.parses it. A bad rewriter that produces invalid Python fails LOUDLY here, not silently at first-call in a downstream user's training run. tests/test_compiler_rewriter_exhaustive.py (79 tests) Picks up the rewriter-site tail round-2's 34-pattern sample missed. Distribution: unsloth_zoo/compiler.py 22 unsloth_zoo/patching_utils.py 8 unsloth_zoo/saving_utils.py 9 unsloth_zoo/temporary_patches/* 4 unsloth_zoo/rl_replacements.py 1 unsloth_zoo/training_utils.py 1 unsloth/models/rl.py 23 (sibling upstream sees coverage too) unsloth/trainer.py 1 shared zoo constants 3 User directive applied: every KNOWN ACTIVE DRIFT is FAIL not SKIP. Two skips converted -> fails on this round: compiled_autograd.end_capture packed_inputs arg drift (torch >= 2.7) -- zoo's PR #135795-equivalent dormant. _supports_sdpa marker dropped from transformers 4.57+ -- zoo's compiler.py:3390-3392 SDPA-gated path dormant. tests/test_temporary_patches_exhaustive.py (108 tests) Walks every .py file under unsloth_zoo/temporary_patches/ and pins every (model_class, method_name) pair the file monkey-patches. Distribution: bitsandbytes (5) deepseek_v3_moe (5) gemma (5) gemma3n (5) gemma4 (2) gemma4_moe (5) glm4_moe (2) gpt_oss (15) ministral (0; already pinned) misc (21) mxfp4 (6) pixtral (5) qwen3_5_moe (3) qwen3_moe (4) qwen3_next_moe (2) qwen3_vl_moe (5) cross-file shared (18) LOCAL VERIFICATION pytest <all 12 upstream-regression files> -> 594 passed, 5 failed, 32 skipped in 14.54s The 5 failures are REAL upstream drifts the matrix is supposed to flag loudly. Each is a follow-up fix in unsloth_zoo/import_fixes.py: 1. compiled_autograd.end_capture packed_inputs (torch 2.7+) 2. _replace_with_bnb_linear skip_modules rewriter no-match 3. CsmDepthDecoder.forward signature 4. CsmForConditionalGeneration.forward signature 5. Pixtral attention forward signature YAML round-trip OK; workflow-trigger lint clean (6 files scanned, no pull_request_target / workflow_run / cache-key issues). CI WIRING .github/workflows/consolidated-tests-ci.yml updated: the core-upstream-matrix `pytest upstream-regression suite` step now runs all 12 files (626 tests / cell). Still HARD GATE. * zoo: mirror unsloth fix_trl_vllm_ascend / patch_datasets / patch_enable_input_require_grads / disable_torchcodec_if_broken Ports four import-time fixes from unsloth/import_fixes.py that zoo was missing. All four are forwards / backwards compatible with transformers 4.57.6 through 5.x, TRL 0.22 through 1.x, and torch 2.4 through 2.11. fix_trl_vllm_ascend Coerces tuple-cached `_*_available` flags in trl.import_utils back to bool. transformers >= 4.48's `_is_package_available` returns a (bool, version_or_None) tuple, which TRL caches verbatim. A non-empty tuple is always truthy, so `if is_X_available():` fires even when X is missing and triggers an unconditional `import X` that explodes (the headline case is `vllm_ascend` blocking `from trl import GRPOConfig, GRPOTrainer` outside Huawei Ascend hosts; deepspeed, llm_blender, joblib share the same shape). patch_datasets Pre-flight guard for the known-broken `datasets` 4.4.x window (4.4.0 and 4.4.1 trigger `_thread.RLock_recursion_count` style recursion errors in normal use). Raises a loud actionable error so users downgrade rather than chasing a confusing stacktrace deep inside data prep. patch_enable_input_require_grads Replaces transformers' `PreTrainedModel.enable_input_require_grads` body so vision sub-modules without token embeddings (e.g. GLM V4.6's `self.visual`) stop crashing the post-PR-41993 modules() walk. The patched body swallows `NotImplementedError` from `get_input_embeddings()` on the sub-modules that don't have a token table, dedupes by embedding identity (handles tied embeddings), and only fires when the installed transformers really is on the new loop shape (`for module in self.modules()` token in the source). disable_torchcodec_if_broken Flips transformers' `_torchcodec_available` cache to False when torchcodec is installed but its native libs (FFmpeg) can't load. Forwards-compatible with the transformers 5.x rename: probes any `*torchcodec*available*` cache attribute, not just the legacy underscore-prefixed name. Design notes Each fix is gated to fire only when the upstream pathology is currently active on the installed stack (no-op otherwise), is idempotent (a second call sees the already-applied state and returns), and is defensive against missing optional imports. The patched `enable_input_require_grads` uses `__name__` as the idempotence sentinel so a re-entry is cheap; the trl coercion only rewrites attrs that are still tuples; the torchcodec probe attempts a real `AudioDecoder` import (the actual breakage trigger) and only acts when that fails. All four are registered in `apply_import_fixes()` so they fire at zoo import time alongside the existing triton / vllm / peft fixes. Three implementation strategies were evaluated for the most complex of these (`patch_enable_input_require_grads`): (a) blanket monkey-patch ignoring upstream guard, (b) gated patch using `"for module in self.modules()"` source-string detection, (c) hybrid that also inspects `inspect.getsourcefile` to read the upstream body fresh. The committed approach takes (b)'s gating precision (so we never touch transformers on the pre-PR-41993 stack where the upstream body works fine) and adds (a)'s defensive exception-handling on every sub-module probe (so an exotic sub-model that raises something other than NotImplementedError still doesn't take down the walk). * patching_utils: accept torch 2.7+'s `with _disable()` shape in compiled_autograd recognizer `patch_compiled_autograd` short-circuits if the upstream `AutogradCompilerInstance.end_capture` source already contains a `with disable()` block, since that means upstream already fixed the PR #135795 double-compile bug natively. torch 2.7+ landed the fix with the underscore-prefixed form `with _disable()` instead, so the old substring check missed it and zoo tried to apply its rewriter on top of an already-patched body. The rewriter then no-ops cleanly (the legacy needle isn't there to replace) but produces a noisy "re-entering an already-disabled context" warning and triggers the drift test as a false positive. Fix: extend the recogniser to accept BOTH `with disable()` and `with _disable()`. Either form means upstream has the fix and zoo should bail before the rewrite. Older torch builds (2.5 and 2.6 shipped the legacy `with disable()` after the cherry-pick) still hit the original short-circuit unchanged; newer torch (2.7+) hits the new short-circuit. Pre-fix torch (no `disable` wrapper at all) falls through to zoo's existing rewriter and gets the original patched-in `with disable()` wrap. Three strategies were considered: (a) regex `\bwith\s+_?disable\(\)` -- broadest, but matches the string in a stray comment too, (b) two literal substring checks -- exact, readable, no false positives on `disable_compile()`-style helpers, (c) parse `inspect.getsource` with `ast` and look for a `With` node calling `disable` / `_disable` -- most robust but pays AST cost on every zoo import. Committed approach is (b): two literal substring checks. Matches the shape of the surrounding code (literal-substring matching is the existing zoo style for these recognisers), avoids the regex false- positive surface, and avoids the AST import cost on a hot path. Idempotent and no-op when the drift isn't present (a torch build older than the original PR #135795 fix has neither form in source and the rewriter fires as before). * compiler: future-proof source rewriters for transformers 4.50+ shape changes Five rewriters in `compiler.py` were silently no-opping on modern transformers because their pinned patterns no longer match the upstream source. Adds modern-shape detection alongside the legacy patterns so each rewriter handles both shapes; legacy patterns are preserved and still fire on older transformers. a. output_attentions super().forward chain (compiler.py:316). Old shape: `if output_attentions: ... return super().forward(...)` on transformers <= 4.49. New shape on 4.50+: the eager-attention chain is removed entirely; forward takes a `**kwargs` catch-all and the bug zoo was working around is gone upstream. The committed rewriter tries the strict regex first, then falls back to a whitespace-tolerant variant that handles partial-shape transformers that kept the `if output_attentions:` guard but dropped the super() return, and finally returns source unchanged when neither matches (the correct no-op on 4.50+). b. is_torch_tpu_available rewrite in Trainer (compiler.py:3988). transformers 4.43+ renamed `is_torch_tpu_available` to `is_torch_xla_available`. Adds a second `replace()` call so both shapes are hardened to `False`; older transformers fall through the first replace, newer transformers fall through the second. Both replaces are idempotent on already-substituted source. c. _update_causal_mask detection (compiler.py:3567). Old shape: model class exposes a `_update_causal_mask` method we rebind to `no_update_causal_mask`. New shape on transformers 4.50+: modern Llama / Mistral / Qwen3 use `create_causal_mask` from `transformers.masking_utils` inside `forward` instead. Adds a fallback that reads `inspect.getsource(cls.forward)` and tests for `create_causal_mask` / `transformers.masking_utils` tokens. The downstream assignment site (3815) still has a `hasattr` guard so modern-shape classes that lack the method don't get a bogus rebind; they just stay in the candidate list for the no-op short-circuit. d. MOE_ROUTING_WEIGHTS_CAST_PATTERN regex (compiler.py:2466). Legacy regex pins `routing_weights = routing_weights.to(hidden_states.dtype)` exactly. Adds a forwards-compat secondary regex that also tolerates `self.<attr>.dtype` / `inputs_dtype.dtype` on the .to() argument, for prospective 5.x rewrites of the MoE blocks. `patch_moe_routing_weights_cast` tries the legacy pattern first, then the new one. The two patterns share the same replacement (route the cast through `router_logits` so the higher-precision dtype is preserved). e. _supports_sdpa = True/False marker check (compiler.py:3430). The class-level marker was removed from most modeling files in transformers 4.50+ (the "attention interface" refactor moved SDPA dispatch to `ALL_ATTENTION_FUNCTIONS`). Adds a third fallback, `_all_attention_functions_has_sdpa()`, that probes the registry directly and treats a registered "sdpa" entry as evidence the model supports SDPA via the runtime dispatcher. Probes the canonical post-4.50 name plus a handful of plausible 5.x rename candidates so this survives further upstream churn. Triangulation Three implementation directions were considered before settling on the committed shape for each rewriter: (1) Hard rewrite to the new pattern, dropping the old one. Cleanest but breaks transformers < 4.50. (2) Detect-and-skip: short-circuit when the new pattern is present. Simpler, but loses the optimisation on builds that BOTH expose the new pattern AND would benefit from the rewrite. (3) Additive: legacy first, modern fallback, both reduce to the same end state. Slightly more code; preserves behaviour on every supported transformers version. Committed: (3) for all five rewriters. Each fallback is gated so it only runs when the legacy match returns zero substitutions; the hot path on the supported-today transformers stack is unchanged. All five fallbacks are no-op when the drift isn't present. * tests: read upstream signatures through the _original_* stash, mark torch 2.7+ / 4.50+ benign rewriters as SKIP Six tests were false-failing because they read function objects that zoo's own import-time patches had already overwritten by the time the test ran. Test-correctness bugs (Fix Group 5) test_temporary_patches_exhaustive.test_pixtral_attention_forward_signature test_temporary_patches_exhaustive.test_csm_depth_decoder_for_causal_lm_forward_named_params test_temporary_patches_exhaustive.test_csm_for_conditional_generation_forward_named_params test_compiler_rewriter_exhaustive.test_patching_utils_replace_with_bnb_linear_skip_modules_pinned All four read `inspect.getsource(...)` (or `inspect.signature(...)`) off a class attribute that `temporary_patches/` or `patching_utils.py` has already rebound. The live attribute is zoo's wrapper, not the upstream original; the test's pinned tokens / parameter names live in the upstream body that's been overwritten in-process. Fix: resolve through the canonical `_original_<module>_<class>_<attr>` stash that `temporary_patches.utils.patch_function` already installs on every patched class, falling back to reading the original module source via `inspect.getsourcefile()` + `Path.read_text()` when the patch doesn't go through `patch_function` (the bnb case patches via `setattr(transformers.integrations.bitsandbytes, ...)` and doesn't go through patch_function's stash machinery). Adds two helpers to the temporary_patches test module: `_resolve_upstream_method(cls, method_name)` -- returns the stashed upstream original if present, else the live attribute. `_maybe_skip_if_patched(cls, method_name, zoo_file)` -- skips cleanly with a "already-patched" reason when the live attribute is a zoo wrapper AND no stash is available (rare; happens when a patch_function call ran with `store_original=False`). Benign-rewriter SKIPs (Fix Group 6) test_compiler_rewriter_exhaustive.test_compiler_supports_sdpa_marker_in_full_source test_compiler_rewriter_exhaustive.test_patching_utils_compiled_autograd_end_capture_return_compiled_fn_pinned These two tests were marked as drift = FAIL, but a closer reading shows the underlying bugs they were drift-detecting have been fixed upstream natively: * SDPA: transformers 4.50+ moved SDPA dispatch to `ALL_ATTENTION_FUNCTIONS`; the `_supports_sdpa` class-level marker is gone but the runtime SDPA dispatch still works. Zoo's source-string branch at compiler.py:3430 is dormant, but the new `_all_attention_functions_has_sdpa()` fallback in the same block keeps SDPA enabled for the optimised pipeline. Behaviour is benign. * compiled_autograd: torch 2.7+ wraps `compiled_fn` in `with _disable()` natively (the upstream fix landed). Zoo's `patch_compiled_autograd` recogniser now accepts both shapes and no-ops cleanly when the wrap is present. The rewriter is dormant but not broken. Converted both `pytest.fail` blocks to `pytest.skip` with a loud "BENIGN" prefix and a one-line explanation of WHY the dormant rewriter is correct on this build, plus a forward-looking pointer so a future maintainer who sees the skip knows the rewriter can be pulled out for cleanup if upstream stays on these shapes long-term. All four signature tests now pass on transformers 4.57.6 + zoo's apply_import_fixes; both benign-rewriter tests cleanly skip.

github-code-quality Bot found potential problems Apr 7, 2026

View reviewed changes

gemini-code-assist Bot reviewed Apr 7, 2026

View reviewed changes

github-code-quality Bot found potential problems Apr 7, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 7, 2026

View reviewed changes

github-code-quality Bot found potential problems Apr 7, 2026

View reviewed changes

danielhanchen merged commit c844b6f into main Apr 7, 2026
3 checks passed

chatgpt-codex-connector Bot reviewed Apr 7, 2026

View reviewed changes

danielhanchen mentioned this pull request Apr 19, 2026

[Qwen 3.5][gemma4] Qwen35 and Gemma 4 fast inference #588

Open

		cls.get_text_config = get_text_config


		def patch_Gemma4Config_kv_shared_zero():

Conversation

danielhanchen commented Apr 7, 2026

Summary

Root cause

Who is affected

The fix

Compatibility

Verification

Unit tests

End-to-end on Gemma-4 31B-Vision

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danielhanchen commented Apr 7, 2026

Additional verification: 26B-A4B use_cache=True vs use_cache=False

Setup

Result

What this confirms

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

danielhanchen commented Apr 7, 2026

Second Gemma-4 bug found, fixed in the same PR (E2B and E4B, use_cache=False with KV sharing)

Who is affected by each bug

Root cause of Fix 2

Reproduction on unsloth/gemma-4-E2B-it (num_kv_shared_layers=20), before Fix 2

After Fix 2

What the fix does

Fix 1 is still safe

Upstream references

Commit log on this branch

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danielhanchen commented Apr 7, 2026

Gemma-4 cache bugs in one page

Bug 1: IndexError on every Gemma-4 31B and 26B-A4B inference call

Bug 2: use_cache=False produces garbage on Gemma-4 E2B and E4B

The fix in one sentence per bug

Coverage is clean

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

danielhanchen commented Apr 7, 2026

Review-driven hardening (commit 98ccb71)

1. Positional argument handling

2. return_dict=False tuple cache leak

Verification

Findings not addressed in this commit

Commit log on this branch

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 7, 2026

Additional verification: 26B-A4B `use_cache=True` vs `use_cache=False`

Second Gemma-4 bug found, fixed in the same PR (E2B and E4B, `use_cache=False` with KV sharing)

Reproduction on `unsloth/gemma-4-E2B-it` (`num_kv_shared_layers=20`), before Fix 2

Bug 1: `IndexError` on every Gemma-4 31B and 26B-A4B inference call

Bug 2: `use_cache=False` produces garbage on Gemma-4 E2B and E4B

Review-driven hardening (commit `98ccb71`)

2. `return_dict=False` tuple cache leak

Third Gemma-4 fix: prequant 4bit `patch_embedder` (commit `9a06c14`)