[staging] Fix Studio q2_k_l GGUF export + new llama.cpp converter package layout by danielhanchen · Pull Request #18 · danielhanchen/unsloth-zoo-staging

danielhanchen · 2026-05-18T10:49:32Z

Throwaway staging PR validating the q2_k_l preset dispatch + new llama.cpp conversion/ package layout patcher on Linux, macOS, and Windows runners. Do not merge — close after CI confirms green; the real fix will go to unslothai/unsloth-zoo.

Summary

Bug 1: MLX/Studio export crashes with main: invalid ftype 'q2_k_l' on Apple Silicon. q2_k_l is an Unsloth-side preset (q2_k + --output-tensor-type q8_0 + --token-embedding-type q8_0). The CUDA path already knew this; the MLX path did not. Centralised the expansion inside quantize_gguf so every caller shares one source of truth.
Bug 2: Upstream llama.cpp split convert_hf_to_gguf.py into a thin entrypoint + conversion/ package. Three Unsloth regex patches still targeted the old monolith, producing three warnings and silent breakage: empty architecture allowlist, missing Unsloth metadata branding on produced GGUFs, and a confusing-but-obsolete Qwen2MoE warning. Patcher is now layout-aware: AST-extracts archs from conversion/__init__.py, patches conversion/base.py in place (idempotent marker), skips the obsolete Qwen patch.

Cross-OS safety

q2_k_l branch is preset-only — every other ftype (incl. real llama.cpp q3_k_l) takes the original code path byte-for-byte unchanged.
Patcher detects layout structurally — old monolithic checkouts behave identically.
Linux user's previously-successful Llama-3.1 → Q3_K_M export continues to work; it just regains the Unsloth branding metadata.

Test plan

24 new tests, all green locally:

Linux: pytest tests/test_quantize_gguf_q2_k_l.py tests/test_convert_hf_to_gguf_patcher.py -v
macOS-14: same
Windows: same

CI: .github/workflows/studio-export-fix-ci.yml — 3-OS matrix, paths: filtered, concurrency.cancel-in-progress: true, max-parallel: 3 to stay under the 5-Windows-runner cap.

- Add "mlx" device detection in device_type.py (Darwin + arm64 + mlx.core) - Add triton and bitsandbytes stubs for MLX (no CUDA deps needed) - Inject stubs in __init__.py when running on Apple Silicon - Add MLXTrainer + MLXTrainingConfig mirroring SFTTrainer API - Add CCE loss via mx.fast.cce_loss (memory-efficient, no full logit materialization) - Add baseline CE fallback via nn.losses.cross_entropy - Add FastLanguageModel loader using mlx-lm (from_pretrained + get_peft_model) - Add mlx-cce and mlx-lm as optional deps in pyproject.toml - Guard torch.cuda.memory calls in gpt_oss.py for non-CUDA devices Benchmarked on M4 Max across 6 models (Llama-1B/3B, Gemma-2-2B, Qwen-3B/7B): - All 6 models PASS correctness (max loss diff < 0.04) - Memory savings: 5-22% - Speed: neutral to 13% faster

- MLXTrainer: gradient_checkpointing, mx.compile graph fusion, per-step timing, grad clipping (max_grad_norm=1.0 matching HF default) - mlx_utils: apply/remove gradient checkpointing, quantized CCE loss fn with fused matmul kernels, dynamic-length batching via iterate_batches, proper padding mask using lengths - Benchmark: subprocess isolation per (model, config), alternating config order to avoid systematic bias, streaming output, warmup exclusion - Add standalone quantized CCE benchmark script

…ability - Rewrite training loop to mlx-lm compiled step pattern (single compiled fn for fwd+bwd+accumulate+update) - Add _ensure_lora_frozen() to prevent NaN from unfrozen LayerNorm in adaptive optimizers - Add mx.stop_gradient(w) to skip weight grad for frozen LM head in LoRA - Fix quantized biases fallback (was layer.scales, now None) - Loss functions return (loss, ntoks) tuples for proper grad accumulation - Add LoRA+, eval loop, streaming batches, Muon/Lion optimizers, LR scheduling - Default compile=True, gradient_checkpointing=True, max_grad_norm=0 - Benchmark: log every step, constant LR, add seed for reproducibility

- Add unsloth_zoo/mlx_cce/ subpackage with chunked cross-entropy engine (pure Python + inline Metal shaders, no C++ build needed) - Remove external mlx-cce-runtime dependency from pyproject.toml - Add mlx and mlx-lm as platform-conditional main dependencies (macOS arm64 only) so pip install unsloth gets everything - Simplify mlx_utils.py: direct import from bundled mlx_cce, remove lazy import machinery and has_cce_kernel() - Simplify mlx_trainer.py: remove CCE availability check and fallback warning (CCE is always bundled)

Clean up dead code and experimental code in mlx cce

- Add mlx_cce/ subpackage: chunked cross-entropy loss engine (pure Python + Metal shaders) - Add mlx_trainer.py: MLXTrainer with compile, gradient checkpointing, LoRA+ support - Add mlx_loader.py: FastMLXModel for model loading via mlx-lm - Add mlx_utils.py: loss functions, batching, model save/export (safetensors, GGUF, Hub) - Add stubs/ for triton and bitsandbytes on Apple Silicon - Update __init__.py: detect MLX and skip GPU init on macOS arm64 - Update pyproject.toml: conditional deps (skip torch/trl/peft on Mac, add mlx/mlx-lm) - Update llama_cpp.py: make torch import optional, fix CUDA-only assumptions

VLM support for MLX training, fp32 loss accumulation, train_on_responses, fully-masked prompt check. Conflicts resolved: - pyproject.toml, .gitignore, __init__.py, device_type.py, llama_cpp.py, temporary_patches/misc.py: kept HEAD (already updated for unslothai main) - mlx_loader.py, mlx_trainer.py, mlx_utils.py: took feat/mlx-vlm versions (new VLM features) - mlx_cce/__init__.py: took feat/mlx-vlm version (restores install_mlx_fast_cce_loss) - mlx_cce/runtime_cce.py: took feat/mlx-vlm version (fp32 grad_weight accumulation)

…**kwargs) (unslothai#651) Generalises the Pixtral-specific tolerance from unslothai#650 to all drift detectors that use ``_assert_params_superset``. Newer transformers wrap an expanding set of model forwards as ``(self, *args, **kwargs)`` via attention-implementation dispatch and generic decorators. Recent casualties on transformers 4.57.6 include: - tests/test_temporary_patches_exhaustive.py ::test_csm_depth_decoder_for_causal_lm_forward_named_params ::test_csm_for_conditional_generation_forward_named_params The zoo patches at those sites pass through kwargs verbatim, so the call sites stay valid; the static param-name pin just is not meaningful against a passthrough wrapper. Skip with a descriptive message instead of emitting spurious DRIFT. The original FAIL branch is preserved for the case where required names are missing AND the function does not take **kwargs (a real upstream removal). Strict behaviour on transformers 4.x with introspectable signatures is unchanged.

The previous default of 5.0 essentially never fired on real transformer training: per-element gradients in steady state are typically 1e-3..1e-1, so |g_i| > 5.0 is extremely rare even on spike batches, mixed-precision overflow, or RL gradient bursts. The threshold was nominally "elementwise clipping" but in practice a no-op, leaving the trainer with no protection in the regimes where clipping actually matters. 1.0 still keeps the trainer on MLX's fast per-leaf ``tree_map(mx.clip, ...)`` path (no global reduction, no cross-leaf synchronisation) while actually catching outliers, and aligns with the universal LLM ``clip_grad_norm=1.0`` baseline used by HuggingFace Trainer / TRL / PEFT / AutoTrain. Empirical sanity-check on a CUDA mirror of ``tests/studio/run_real_mlx_smoke.py`` (same model / LoRA / data / optimizer; gradient clipping reproduced by ``clip_grad_value_``): norm_grad_norm=1.0 (control) losses 7.64 -> 0.006 PASS max_grad_value=5.0 (old) losses 7.29 -> 8.39 DIVERGED max_grad_value=1.0 (new) losses 7.29 -> 3.4 (range across 3 seeds 3.21..3.65; consistent plateau, no divergence) max_grad_value=0.5 / 0.25 / 0.1 noisier still 5.0 lets the optimizer overshoot the local basin on the 7-step smoke; 1.0 produces a stable (if conservative) update regime across seeds. On real long-horizon training, the difference is almost invisible -- 1.0 fires only when needed. Updates the dataclass default, the inline ``getattr`` fallback and its sentinel branch in ``MLXTrainer.fit``. Behaviour preserved when caller explicitly passes ``max_grad_value``; only the unset path changes. The "max_grad_value wins when both set" policy is kept for now to avoid silently slowing down callers who upgraded to this build expecting the fast per-element path; a separate change can revisit that.

…nslothai#647) * saving: layout-aware MoE LoRA merge + loud-fail on fallback (#5410) `save_pretrained_merged(save_method="merged_16bit")` silently dropped the entire MoE expert LoRA delta on Qwen3-MoE / Qwen3.5-MoE-style models with peft >= 0.19.1. The per-expert helpers in `saving_utils.py` hardcoded the PEFT 0.18 "swapped" tensor layout (`lora_A: (E*r, 2I)`, `lora_B: (H, E*r)` for gate_up_proj; `lora_A: (E*r, H)`, `lora_B: (I, E*r)` for down_proj), while PEFT 0.19+ swaps in/out features for non-transposed 3D parameters and produces `lora_A: (E*r, H)`, `lora_B: (2I, E*r)` and `lora_A: (E*r, I)`, `lora_B: (H, E*r)`. The layout mismatch hit a bare `except Exception: return W` and the dim-heuristic fallthrough in the fused helpers, so the merge silently wrote unmodified base weights and reported success. The `num_experts` value used by the per-expert loop was also taken from the shard-local key scan, which is a non-divisor of `total_rank` whenever experts are split across multiple safetensor shards (16/17 of 128 on Qwen3-30B-A3B). Finally the merged dir was missing `generation_config.json`, so chat-tuned models reloaded with default eos / sampling and ran past EOS. Changes: - `_detect_moe_lora_layout(lora_A, lora_B, num_experts, out_dim, in_dim)` classifies the layout by shape against the per-expert disk weight, so no version sniffing is required. Works on transformers 4.57.x / 5.x and peft 0.18.x / 0.19.x. - `_merge_moe_gate_or_up_expert` and `_merge_moe_down_proj_expert` branch on the detected layout. The "swapped" path is byte-identical to the previous behaviour. - `_resolve_num_experts_from_lora_stats` walks `module -> base_layer -> ...` to read the authoritative `num_experts` off the wrapped MoE module (`Qwen3MoeExperts` etc). `_merge_and_overwrite_lora` uses it to override `moe_num_experts[prefix]` after the converted-key build, so the per-expert loop never trips on a shard-local count. - `_MOE_MERGE_STATE` tracks `(attempted, applied, fallback, first_error)`. Each helper records a fallback with role / expert / shapes / reason on any unrecognised layout or exception. After the shard loop `merge_and_overwrite_lora` raises `RuntimeError` if any fallback fired, so partially-merged checkpoints can no longer be silently written. On success it prints `applied/attempted`. - The `merged_16bit` branch also calls `model.generation_config.save_pretrained(save_directory)` (best-effort, matching `fix_tokenizer_config_json`). Tests: - Existing 16 per-expert / fused / dense merge tests in `test_unsloth_zoo_lora_merge.py` still pass byte-for-byte (PEFT 0.18 swapped layout is the default branch). - 6 new tests: * standard layout for `_merge_moe_gate_expert`, `_merge_moe_up_expert`, `_merge_moe_down_proj_expert`, * layout classifier for both conventions and the unknown cases, * fallback counter increments and `first_error` populates on unrecognised shapes, * `_resolve_num_experts_from_lora_stats` walks the `base_layer` chain. End-to-end verification on Qwen3-30B-A3B (128 experts x 48 layers, fused 3D in memory, per-expert 2D on disk), full SFT + save + reload + logit compare: | transformers | peft | trl | merged tensors | trained vs merged KL | samples | |--------------|--------|--------|----------------|----------------------|---------| | 5.5.0 | 0.19.1 | 0.25.1 | 18432 / 18432 | 1.6e-5 | 3 / 3 | | 5.5.0 | 0.18.1 | 0.25.1 | 18432 / 18432 | 1.3e-5 | 3 / 3 | | 4.57.6 | 0.19.1 | 0.25.1 | dense path | 5.5e-5 | 3 / 3 | | 5.5.0 | 0.19.1 | 1.4.0 | 18432 / 18432 | 2.1e-4 | 3 / 3 | Before the patch the M1 row was KL=1.86, samples=1/3, and 0/18432 expert LoRA deltas were applied. transformers 4.57.6 has `experts = nn.ModuleList (Qwen3MoeMLP)` (no fused 3D parameter) so the MoE merge helpers do not fire and every per-expert Linear takes the standard dense `_merge_lora` path. The MoE helpers are unreachable on transformers <5; the patch only affects the path that produces the bug. Fixes unslothai/unsloth#5410. Likely also resolves unslothai/unsloth#4832 (same author, same "garbage after save_pretrained_merged reload" symptom on DevStral Small 2). * saving: bound _resolve_num_experts walk + harden against hostile attrs The base_layer walk in _resolve_num_experts_from_lora_stats was an unbounded `while module is not None` loop. PEFT's ParamWrapper does not self-reference in practice, but a self-referential or cyclic `base_layer` chain would hang the merge. Bound the walk to 16 hops, dedupe via an id() set, and swallow exceptions on getattr / getattr-of-attrs so a hostile module that raises on attribute access cannot abort the merge. Confirmed by a synthetic suite (52 cases) across three isolated venvs: peft 0.18.1 + transformers 5.5.0, peft 0.19.1 + transformers 5.5.0, peft 0.19.1 + transformers 4.57.6. All 22 existing merge tests still pass byte-for-byte in each. * saving: trim verbose comments per maintainer style Tighten the docstrings and inline comments added by the layout-aware MoE merge work so the diff is closer to the surrounding house style (see chore unslothai#640). No behaviour change; 22 / 22 merge tests still pass. --------- Co-authored-by: Daniel Han-Chen <info@unsloth.ai>

…ai#653) After zoo#647 the gate/up MoE merge entry points became thin wrappers delegating to _merge_moe_gate_or_up_expert. The drift detector test only inspected the wrapper source and so reported a false-positive regression (no .to('cuda') call, no _active_merge_device call inside the wrapper body itself). Update test_moe_expert_merges_call_active_merge_device to follow one level of delegation through sibling _merge_moe_* helpers via a small regex-walked source-union helper, so the union still requires _active_merge_device() and forbids hardcoded .to('cuda'). Local: pytest -k test_moe_expert_merges_call_active_merge_device passes (1 passed in 0.26s).

PreTrainedModel.__init__ resolves loss_type via regex on the class name, so Qwen3_5ForConditionalGeneration lands on LOSS_MAPPING["ForConditionalGeneration"]. That key (and CsmForConditionalGeneration) is aliased to the stock ForCausalLMLoss in transformers, and patch_loss_functions only ever rewrote the "ForCausalLM" entry. The result is that affected classes silently fall back to the un-patched loss, which materialises (seq_len x vocab_size) fp32 logits and OOMs on <= 24 GB GPUs at large vocab sizes. Sweep every entry whose function is currently named ForCausalLMLoss and replace it with the Unsloth kernel. Idempotent (after the first call no key reports __name__ == "ForCausalLMLoss") and leaves unrelated loss types alone. Closes unslothai/unsloth#5441

unsloth/pyproject.toml ships intel-gpu extras pinning torch-2.11.0+xpu and torch-2.12.0+xpu (added in unslothai/unsloth#5484), but unsloth_zoo's own torch constraint was capped at <2.11.0. The pip resolver consequently could not satisfy `pip install .[intel-gpu-torch2110]` or `.[intel-gpu-torch2120]` end-to-end: it either failed outright or fell back to a stale unsloth_zoo (2026.3.6) that does not enforce the cap. Relax the cap to <2.13.0 so torch 2.11.x and 2.12.x are admissible. Leaves torch>=2.4.0 floor intact and keeps the future-version cap as a safety net.

Adds an opt-in (UNSLOTH_FUSED_FORWARD=1) auto-installer that rewrites the canonical lm_head + self.loss_function triplet on every transformers `*ForCausalLM` / `*ForConditionalGeneration` whose forward matches the shape used from transformers 4.56 onwards. Skipping logits.float() over (seq_len x vocab_size) avoids the OOM that surfaced in #5441 and shaves the bf16 logits tensor as well. Layers: unsloth_zoo/fused_losses/forward_adapter.py Maps the HF self.loss_function(logits=..., labels=..., vocab_size=..., **kwargs) calling convention onto unsloth_fused_ce_loss. Pops num_items_in_batch -> n_items, threads ignore_index / label_smoothing / logit_softcapping / logit_scale_multiply / logit_scale_divide, and falls back to a stock CE if the caller passes a pre-shifted shift_labels tensor (unsupported by the chunked kernel today). unsloth_zoo/fused_losses/ast_rewriter.py NodeTransformer that recognises the canonical triplet: <NAME> = self.<HEAD>(<HIDDEN_EXPR>[...]) loss = None (optional) if labels is not None: <LOSS> = self.loss_function(<NAME>, labels, vocab_size=..., **kwargs) and rewrites it to call unsloth_fused_lm_head_loss(<HIDDEN_EXPR>, self.<HEAD>, labels, ...). Tolerates keyword vs positional vocab_size, `.float()` / `[slice]` chains around the lm_head call, and detects logits re-binding (e.g. Cohere's `logits = logits * self.logit_scale`) as a refuse signal so we never produce a broken forward. unsloth_zoo/fused_losses/forward_install.py Two-tier installer: (1) hash-allowlist fast path via register_canonical(hash, forward_fn); (2) AST triplet rewrite. Driven by a meta-path import hook that intercepts transformers.models.<X>.modeling_<X> imports and patches eligible classes as their module loads. Soft floor at transformers >= 4.56. audit() returns a JSON-safe dict of patched / unmatched / failed classes for observability. Kernel updates: unsloth_zoo/fused_losses/cross_entropy_loss.py compute_fused_ce_loss + UnslothFusedLoss.forward now thread ignore_index (default -100) into the label-shift step and the inner F.cross_entropy call. compute_fused_ce_loss also accepts label_smoothing. Matches HF ForCausalLMLoss semantics so callers that override either no longer silently regress. Tests (tests/test_fused_forward_install.py, 14 cases): - AST rewriter accepts keyword form, positional vocab_size, `.float()` wrapper. Declines non-canonical, declines on logits rebinding. - install_for_class: noop when disabled, skips ineligible names, patches canonical, idempotent, function-override fast path, audit() snapshot. - Numerical equivalence on a toy CUDA model: fused loss within bf16 -> fp32 rounding noise of the reference. - Kernel respects ignore_index and label_smoothing kwargs. End-to-end equivalence on Llama-3.2-1B + alpaca-cleaned (seed 3407, max_steps 10): identical step-1 loss + grad_norm, max |loss delta| = 0.005, max |grad_norm delta| = 0.025 across the run. Audit reported 19 classes patched, 0 failed when UNSLOTH_FUSED_FORWARD=1 (LlamaForCausalLM, Qwen3ForCausalLM, MistralForCausalLM, Gemma2/3 / GemmaForCausalLM, Mllama, DeepseekV3, Qwen3MoE / Qwen3Next, Bloom, FalconH1, etc.). Off by default. Set UNSLOTH_FUSED_FORWARD=1 to opt in.

Forwards routed through unsloth_compiled_cache see __globals__ for the cached module, which does not always re-import the HF output dataclass the original modeling file referenced (e.g. Gemma3ForCausalLM's return statement uses CausalLMOutputWithPast). Populate the exec namespace with everything from transformers.modeling_outputs as a fallback so the rewritten forward links cleanly. Caught during multi-model equivalence run (Gemma3-1B fused) which now matches the stock path bit-for-bit alongside Llama, Qwen3, Phi3, and Mistral.

forward_adapter.py - shift_labels fallback now uses reduction=sum and divides by n_items when num_items_in_batch is supplied, matching HF ForCausalLMLoss gradient-accumulation scaling. - shift_labels=False (bool) now routes to the same stock-CE fallback instead of leaking through to the always-shifting fused kernel. - Removed redundant inner import torch. cross_entropy_loss.py - Promote a non-tensor n_items divisor (HF trainers pass a Python int via gradient accumulation) to a scalar tensor before the existing DataParallel .numel()/.ravel() guard, which is preserved verbatim. Without the promotion an int n_items raises AttributeError inside the autograd forward. ast_rewriter.py - Capture the full lm_head RHS (e.g. .float()/.contiguous()/[slice]) and emit it in the else-branch so the inference path keeps its original dtype/shape semantics. - Only strip docstring-only decorators (auto_docstring, add_start_docstrings*, add_end_docstrings, replace_return_docstrings); @can_return_tuple carries return_dict=False semantics and stays. - Reject forwards with non-empty else, multi-statement labels branches, or aliased labels arguments (CSM-style depth-decoder loss survives intact rather than being silently dropped). - Reject forwards where any statement between the lm_head assign and the labels-if mutates or reads logits (Gemma3 final_logit_softcapping used to be silently bypassed by the fused-loss path). - Forward explicit loss_function keywords beyond vocab_size (Bloom passes num_items_in_batch=kwargs.get(...) without a **kwargs unpack). - _find_loss_function_call / _find_loss_assign_target now inspect only the direct if-body, so a nested guard inside the labels branch is not silently dropped. forward_install.py - Drop *ForConditionalGeneration from auto-install eligibility (the fused kernel hardcodes a causal shift; aligned-label seq2seq losses would be off-by-one). - Skip composite/non-linear heads via a _LINEAR_HEAD_ATTRS allowlist so BigBird-style self.cls(...) (BigBirdOnlyMLMHead) is not patched. - install_for_class / install_for_module now also gate on the transformers version floor, matching install_modeling_import_hook. - Inject transformers.utils.generic.can_return_tuple into the exec namespace so the preserved decorator resolves at runtime.

Compress narrative docstrings and inline rationale blocks across fused_losses/* and the __init__.py opt-in stanza. Load-bearing notes (@can_return_tuple semantics, Gemma3 softcap reasoning, BigBird composite-head guard, transformers >= 4.56 floor, ForCausalLM-only eligibility) are preserved; only WHAT-restating prose was removed.

Cover the eight semantic fixes that landed in commit db90fa1 so regressions are caught at test time rather than at training time: - test_ast_rewriter_declines_when_intermediate_touches_logits Gemma final_logit_softcapping between lm_head and the labels-if must not be silently bypassed. - test_ast_rewriter_declines_when_labels_aliased CSM-style `loss = self.loss_function(..., labels=backbone_labels)` on an `if labels is not None:` gate must refuse. - test_ast_rewriter_declines_non_trivial_labels_branch MoE-style auxiliary loss inside the labels branch must refuse so aux_loss + router_aux_loss_coef stays intact. - test_ast_rewriter_forwards_explicit_extra_kwargs Bloom-style `num_items_in_batch=kwargs.get(...)` without **kwargs must reach the kernel. - test_install_skips_for_conditional_generation *ForConditionalGeneration uses aligned labels; auto-install must skip. - test_install_skips_composite_head BigBird-style `self.cls(...)` composite head must hit the _LINEAR_HEAD_ATTRS allowlist and log as non-linear-head. - test_fused_kernel_accepts_int_n_items HF Trainer grad-accum passes a Python int divisor; kernel must promote it to a scalar tensor before the DataParallel guard. - test_adapter_falls_back_when_shift_labels_false `shift_labels=False` bool must route through stock CE; the fused kernel always re-shifts. All 22 tests pass (14 original + 8 new). Multi-model end-to-end equivalence rerun against the post-review tree (seed 3407, max_steps=10, alpaca-cleaned): model s1 eq max|loss d| max|grad d| n_patched Llama-3.2-1B True 0.00450 0.01276 11 Qwen3-0.6B True 0.00490 0.07686 11 Gemma-3-1B True 0.00000 0.00000 11 Mistral-7B-v.3 True 0.00370 0.03093 11 Step 1 loss + grad_norm are bitwise identical for every model; n_patched dropped from 19 -> 11 because ConditionalGeneration + Gemma2/3 (logits touched by softcap) + BigBird (composite head) are now correctly skipped.

is_enabled() now returns True unless UNSLOTH_FUSED_FORWARD is explicitly set to "0". Updated docstrings and the __init__.py comment to reflect the new default. The two-tier installer + LOSS_MAPPING backstop in unslothai#656 means the worst case for any class we touch is no-op (refused via _UNMATCHED or composite-head guard) -- never a worse forward than the stock path. Test suite (23 cases, was 22 + new test_install_default_is_on): all green. Refresh of the multi-model equivalence rerun with no env var set versus UNSLOTH_FUSED_FORWARD=0 (Llama-3.2-1B / Qwen3-0.6B / Gemma-3-1B / Mistral-7B-v0.3, seed 3407, max_steps=10, alpaca-cleaned): model off enabled default enabled s1 eq max|loss d| max|grad d| Llama-3.2-1B False True True 0.00410 0.02336 Qwen3-0.6B False True True 0.00680 0.02561 Gemma-3-1B False True True 0.00000 0.00000 Mistral-7B-v.3 False True True 0.00530 0.05310 Step 1 loss + grad_norm bitwise identical for every model; deltas across the run stay within bf16 -> fp32 chunked-CE rounding noise. Audit reports 11 classes patched at default and 0 patched when explicitly disabled.

trl 1.x padding_free passes shift_labels=<tensor> through the loss function. The adapter previously fell back to a materialised-logits F.cross_entropy in that case, which kept the OOM problem the chunked kernel was supposed to fix. Plumb shift_labels through unsloth_fused_ce_loss instead. The outer UnslothFusedLoss.forward already handles label shifting; when the caller pre-shifted we just flatten and skip the inner re-shift. Files: - cross_entropy_loss.py: unsloth_fused_ce_loss gains shift_labels arg (default True). Outer adds an else branch that flattens pre-shifted labels so chunking aligns with hidden_states. The four inner accumulate_chunk call sites pass False unconditionally now since the outer always normalises labels. - forward_adapter.py: drop the F.cross_entropy fallback. Pick (target, do_shift) based on the shift_labels kwarg and call the fused kernel with shift_labels=do_shift. - test_fused_forward_install.py: rename the stale fallback test and add five fp32-strict numerical checks (atol/rtol=1e-5): * auto-shift matches F.cross_entropy * pre-shifted tensor matches F.cross_entropy * shift_labels=False matches F.cross_entropy * num_items_in_batch divides correctly * int and 0-d tensor n_items produce equal loss Empirical end-to-end checks (10 step Llama-3.2-1B LoRA, max_steps=10): trl 1.4.0 padding_free=True, fused vs off: step 1 loss: 1.45730 == 1.45730 (exact) max delta over 10 steps: 0.003 (bf16 noise) num_items_in_batch wiring (batch=2, grad_accum=4): HF passes a scalar tensor, consistent across the 4 micro-batches in each window. n_items equals sum(non_ignore_labels) - rows in every window (the per-row causal-shift drop), matching the post-shift count HF uses for the mean reduction. 27/27 unit tests pass.

The fused-forward installer (forward_install.py) rewrites *ForCausalLM.forward at import time. Two upstream-pattern tests used inspect.getsource(cls.forward) and got the rewritten body, which no longer contains the canonical upstream lines compiler.py pins. Switch both probes to read the modeling module's on-disk source via __file__ instead. That is the source compiler.py's rewriter actually operates on, and it stays pristine regardless of any runtime patches. Tests affected: - test_compiler_cross_entropy_lm_head_pattern_present - test_compiler_cross_entropy_find_2_loss_function_signature

…unslothai#655)

* Honor UNSLOTH_RETURN_HIDDEN_STATES / UNSLOTH_RETURN_LOGITS in fused forward The AST-rewritten forward installed by PR unslothai#657 only had two branches: labels-not-None (fused CE, EMPTY_LOGITS) and else (real logits, no loss). It silently ignored both env vars that the compiler-rewritten forward in unsloth_zoo/compiler.py honors. For GRPO the compiled forward overrides the AST one so this never mattered in practice, but it left the AST forward behaviourally different from the compiled one and not safe to rely on standalone. Expand the rewrite template to the same three-branch shape as the compiled forward: 1. UNSLOTH_RETURN_HIDDEN_STATES=1 -> hidden_states in the logits slot, no lm_head matmul, no loss. GRPO's hidden-states fast path. 2. labels is not None -> fused CE for loss; logits = EMPTY_LOGITS unless UNSLOTH_RETURN_LOGITS=1, in which case the original lm_head expression runs so callers can train + collect logits in one forward. 3. otherwise -> original RHS verbatim, loss = None. forward_install.py: seed the rewritten forward's globals with os so the env-var reads work on classes whose original forward did not import os. Tests: ordering assertion on the rewriter output plus four CUDA-gated behaviour tests covering each branch and the priority of return-hidden over return-logits when both are set. * Drop UNSLOTH_RETURN_HIDDEN_STATES handling from AST forward The hidden-states fast path is owned by the compiler-rewritten forward in unsloth_zoo/compiler.py, which already overrides the AST forward for every *ForCausalLM class that GRPO actually runs on. Honoring the env var in the AST forward as well was defence-in-depth that nobody hits. Keep the UNSLOTH_RETURN_LOGITS opt-in (closes a real gap: lets callers collect real logits + train via fused CE in one forward). Template now goes back to two top-level branches with a nested if for the logits opt-in: if labels is not None: <fused CE> if UNSLOTH_RETURN_LOGITS == '1': logits = <original RHS> else: logits = EMPTY_LOGITS else: logits = <original RHS> loss = None Tests trimmed to match (29 passed). The ns.setdefault('os', os) seed in forward_install.py stays -- the UNSLOTH_RETURN_LOGITS read still needs os available in the rewritten forward's globals. * Avoid double lm_head matmul on UNSLOTH_RETURN_LOGITS=1 path Previous shape called both unsloth_fused_lm_head_loss (which chunks the lm_head matmul internally to compute CE) and self.<head>(<hidden>) (the full matmul) when the opt-in env var was set. Two matmuls for one materialised tensor. New shape splits the labels branch into two paths and picks the right loss path for each: if labels is not None: if UNSLOTH_RETURN_LOGITS == '1': logits = <original RHS> # one matmul loss = self.loss_function(logits, labels, # same logits vocab_size=V, ...) else: loss = unsloth_fused_lm_head_loss(...) # chunked, fused logits = EMPTY_LOGITS else: logits = <original RHS> loss = None The opt-in path now routes through the model's own self.loss_function on the already-materialised logits. Matches HF's standard CausalLM loss shape and the conditional in unsloth_zoo/compiler.py:2074. Tests assert single-matmul + single-self.loss_function on the opt-in path; numerical equivalence holds bit-identically on the toy in this sim (5.003798 vs 5.003798).

…nslothai#664) Today unsloth_zoo/__init__.py cross-syncs HF_HUB_OFFLINE and TRANSFORMERS_OFFLINE so setting either one implies the other. The datasets library has its own equivalent env var, HF_DATASETS_OFFLINE, which is NOT in the sync. Result: when a user (or a downstream caller like the studio inference worker) sets HF_HUB_OFFLINE=1 because DNS to huggingface.co is dead, load_dataset() still issues a network call for dataset metadata and raises ConnectionError instead of resolving from cache. Extending the cross-sync to all three so any single var implies all three. Existing behaviour (only literal "1" triggers the sync) is preserved. Tests: 5 cases under tests/test_offline_env_cross_sync.py covering each var as the trigger, plus the unset case and the non-"1" value.

Two distinct Studio export bugs, one shared theme: the MLX/Studio export path had Unsloth-specific assumptions implemented in one branch (CUDA) but not centralised enough for every caller. Bug 1 — MLX/Studio q2_k_l preset crashes q2_k_l is an Unsloth-side preset (q2_k + --output-tensor-type q8_0 --token-embedding-type q8_0), not a native llama.cpp ftype. The CUDA save path in unsloth/save.py:_quantize_q2_k_l already knew to expand it. The MLX path in unsloth_zoo/mlx/utils.py:save_pretrained_gguf forwarded the preset string straight to llama-quantize, which aborted with main: invalid ftype 'q2_k_l' on Apple Silicon. Move the expansion into quantize_gguf so every caller shares one source of truth. CUDA continues to route through its own _quantize_q2_k_l first; deduplication is deferred to a follow-up. The preset branch is preset-only: it triggers only when quant_type.strip().lower() == 'q2_k_l'. Every other ftype, including the real llama.cpp q3_k_l, takes the original code path byte-for-byte unchanged. Windows shell=True + _quote path-handling preserved. Bug 2 — Patcher drift on the new conversion/ package layout Upstream llama.cpp split convert_hf_to_gguf.py into a thin entrypoint plus a conversion/ package (conversion/base.py, conversion/qwen.py). Three Unsloth regex patches in _download_convert_hf_to_gguf still target the old monolith, producing three confusing warnings and silent breakage: - "No supported architectures (TEXT or VISION) could be determined": ModelBase._model_classes is empty because per-arch modules register lazily via load_all_models(). - "Metadata branding patch target not found": the self.metadata = gguf.Metadata.load(...) line moved to conversion/base.py, so GGUFs produced by Unsloth no longer carry quantized_by='Unsloth', repo_url, and tags. - "Qwen2MoE num_experts patch target not found": upstream already handles both keys via self.find_hparam(['num_local_experts', 'num_experts']) — the legacy patch is obsolete. Patcher is now layout-aware (_detect_converter_layout). On the new package layout: - Supported-architectures are AST-extracted from conversion/__init__.py:TEXT_MODEL_MAP / MMPROJ_MODEL_MAP (the static dicts), avoiding the cost and side-effects of importing every per-arch module just to read its registration. - Metadata branding is patched IN PLACE on conversion/base.py with an idempotency marker (# UNSLOTH_BRANDING_APPLIED), so the GGUFs regain Unsloth branding. - The obsolete Qwen patch is intentionally skipped with an info log when conversion/qwen.py already searches both expert keys. Old monolithic layout: every existing patch runs unchanged. Cache-key extension: _download_convert_hf_to_gguf now folds conversion/{__init__,base,qwen}.py mtimes+sizes into the @lru_cache key so a re-pulled llama.cpp checkout re-runs the patcher. Tests (24 added, all green locally on Python 3.13): tests/test_quantize_gguf_q2_k_l.py: - q2_k_l expands to q2_k + both q8_0 tensor-type flags - case-insensitive (Q2_K_L / q2_K_L / padded) - 17 other ftypes (q2_k, q3_k_s/m/l, q4_*, q5_*, q6_k, q8_0, bf16, f16, f32) untouched — Linux/Windows non-regression gate - print_output logs both the request and the expansion - error messages keep the preset name (q2_k_l), not the rewrite tests/test_convert_hf_to_gguf_patcher.py: - synthetic package + monolith layouts: detection, AST extraction, branding patch idempotency, lines around target preserved, Qwen alias detection, cache-key invalidation - live upstream raw.githubusercontent.com fetch of current llama.cpp master: layout=package, branding patch applies cleanly, Qwen aliases already handled, TEXT_MODEL_MAP contains LlamaForCausalLM + Qwen* CI: new .github/workflows/studio-export-fix-ci.yml runs the unit tests on ubuntu-latest + macos-14 + windows-latest, max-parallel: 3, with paths: filters so unrelated pushes don't re-fire. Pure-Python (no torch needed), each job finishes in well under 2 minutes.

danielhanchen · 2026-05-19T08:00:52Z

Throwaway dry-run, validated on Linux / macOS / Windows runners. Real fix is at unslothai#667. Closing.

Manan17 and others added 30 commits February 10, 2026 05:48

Delete unsloth_compiled_cache/moe_utils.py

20ea262

updading git ignore

40a85c4

adding compatibility tests file

a61cb84

Adding saving, exporting support

d6b2624

cleanup for MLX

f0029b6

shifting the save and merge model funcitons

901c868

Make GPU deps conditional - skip torch/triton/trl/peft on macOS arm64

dfade5d

Use shared transformers version across platforms

8530d72

channging the transformers versions

531f085

Clean up dead code and experimental code in mlx cce

7e46e8a

Merge pull request #1 from Manan17/pr/cleancce

b56285b

Clean up dead code and experimental code in mlx cce

Initial mlx-vlm support

a708dc4

After fixes

a65a7a6

add review fixes

11e1b36

fix bugs

9067d08

accumulate loss in 32 bit

38d394f

add check for fullly masked prompt after train on responses

26f1fc6

Add step callback to MLXTrainer for real-time progress reporting

4f7fbfa

mlx with studio

44d4e68

fixing vision training

706822f

fixing the variable size images for training

5732c0c

fixing dtype issues

67618d8

send extra kwargs

8aba702

adding chat

78b741a

danielhanchen and others added 25 commits May 15, 2026 01:48

additional import try except handling for mlx (unslothai#654)

7b90fec

Versioning -> 2026.5.2

8f9d692

Scrub .github/workflows for staging push (matches staging base)

bb3242e

Sync .github/workflows with upstream author branch

4ee2aa2

tests: CPU regression detectors for the MoE merge / save path (#5410) (…

adc76ae

…unslothai#655)

Fix GGUF converter sibling imports (unslothai#661)

d50b31a

fix embedding matrix size mismatch bug (unslothai#645)

fc983dd

ci: add tqdm to slim install (imported by llama_cpp.py at module level)

0461ae1

This was referenced May 18, 2026

[staging] e2e Studio q2_k_l GGUF export validation (Linux / macOS / Windows) danielhanchen/unsloth-staging-2#127

Closed

Fix Studio q2_k_l GGUF export and new llama.cpp converter package layout unslothai/unsloth-zoo#667

Merged

danielhanchen closed this May 19, 2026

danielhanchen deleted the studio-export-q2kl-fix branch May 19, 2026 08:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[staging] Fix Studio q2_k_l GGUF export + new llama.cpp converter package layout#18

[staging] Fix Studio q2_k_l GGUF export + new llama.cpp converter package layout#18
danielhanchen wants to merge 145 commits into
mainfrom
studio-export-q2kl-fix

danielhanchen commented May 18, 2026

Uh oh!

danielhanchen commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

danielhanchen commented May 18, 2026

Summary

Cross-OS safety

Test plan

Uh oh!

danielhanchen commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants