Skip to content

[staging] Fix Studio q2_k_l GGUF export + new llama.cpp converter package layout#18

Closed
danielhanchen wants to merge 145 commits into
mainfrom
studio-export-q2kl-fix
Closed

[staging] Fix Studio q2_k_l GGUF export + new llama.cpp converter package layout#18
danielhanchen wants to merge 145 commits into
mainfrom
studio-export-q2kl-fix

Conversation

@danielhanchen

Copy link
Copy Markdown
Owner

Throwaway staging PR validating the q2_k_l preset dispatch + new llama.cpp conversion/ package layout patcher on Linux, macOS, and Windows runners. Do not merge — close after CI confirms green; the real fix will go to unslothai/unsloth-zoo.

Summary

  • Bug 1: MLX/Studio export crashes with main: invalid ftype 'q2_k_l' on Apple Silicon. q2_k_l is an Unsloth-side preset (q2_k + --output-tensor-type q8_0 + --token-embedding-type q8_0). The CUDA path already knew this; the MLX path did not. Centralised the expansion inside quantize_gguf so every caller shares one source of truth.
  • Bug 2: Upstream llama.cpp split convert_hf_to_gguf.py into a thin entrypoint + conversion/ package. Three Unsloth regex patches still targeted the old monolith, producing three warnings and silent breakage: empty architecture allowlist, missing Unsloth metadata branding on produced GGUFs, and a confusing-but-obsolete Qwen2MoE warning. Patcher is now layout-aware: AST-extracts archs from conversion/__init__.py, patches conversion/base.py in place (idempotent marker), skips the obsolete Qwen patch.

Cross-OS safety

  • q2_k_l branch is preset-only — every other ftype (incl. real llama.cpp q3_k_l) takes the original code path byte-for-byte unchanged.
  • Patcher detects layout structurally — old monolithic checkouts behave identically.
  • Linux user's previously-successful Llama-3.1 → Q3_K_M export continues to work; it just regains the Unsloth branding metadata.

Test plan

24 new tests, all green locally:

  • Linux: pytest tests/test_quantize_gguf_q2_k_l.py tests/test_convert_hf_to_gguf_patcher.py -v
  • macOS-14: same
  • Windows: same

CI: .github/workflows/studio-export-fix-ci.yml — 3-OS matrix, paths: filtered, concurrency.cancel-in-progress: true, max-parallel: 3 to stay under the 5-Windows-runner cap.

Manan17 and others added 30 commits February 10, 2026 05:48
- Add "mlx" device detection in device_type.py (Darwin + arm64 + mlx.core)
- Add triton and bitsandbytes stubs for MLX (no CUDA deps needed)
- Inject stubs in __init__.py when running on Apple Silicon
- Add MLXTrainer + MLXTrainingConfig mirroring SFTTrainer API
- Add CCE loss via mx.fast.cce_loss (memory-efficient, no full logit materialization)
- Add baseline CE fallback via nn.losses.cross_entropy
- Add FastLanguageModel loader using mlx-lm (from_pretrained + get_peft_model)
- Add mlx-cce and mlx-lm as optional deps in pyproject.toml
- Guard torch.cuda.memory calls in gpt_oss.py for non-CUDA devices

Benchmarked on M4 Max across 6 models (Llama-1B/3B, Gemma-2-2B, Qwen-3B/7B):
- All 6 models PASS correctness (max loss diff < 0.04)
- Memory savings: 5-22%
- Speed: neutral to 13% faster
- MLXTrainer: gradient_checkpointing, mx.compile graph fusion, per-step
  timing, grad clipping (max_grad_norm=1.0 matching HF default)
- mlx_utils: apply/remove gradient checkpointing, quantized CCE loss fn
  with fused matmul kernels, dynamic-length batching via iterate_batches,
  proper padding mask using lengths
- Benchmark: subprocess isolation per (model, config), alternating config
  order to avoid systematic bias, streaming output, warmup exclusion
- Add standalone quantized CCE benchmark script
…ability

- Rewrite training loop to mlx-lm compiled step pattern (single compiled fn for fwd+bwd+accumulate+update)
- Add _ensure_lora_frozen() to prevent NaN from unfrozen LayerNorm in adaptive optimizers
- Add mx.stop_gradient(w) to skip weight grad for frozen LM head in LoRA
- Fix quantized biases fallback (was layer.scales, now None)
- Loss functions return (loss, ntoks) tuples for proper grad accumulation
- Add LoRA+, eval loop, streaming batches, Muon/Lion optimizers, LR scheduling
- Default compile=True, gradient_checkpointing=True, max_grad_norm=0
- Benchmark: log every step, constant LR, add seed for reproducibility
- Add unsloth_zoo/mlx_cce/ subpackage with chunked cross-entropy
  engine (pure Python + inline Metal shaders, no C++ build needed)
- Remove external mlx-cce-runtime dependency from pyproject.toml
- Add mlx and mlx-lm as platform-conditional main dependencies
  (macOS arm64 only) so pip install unsloth gets everything
- Simplify mlx_utils.py: direct import from bundled mlx_cce,
  remove lazy import machinery and has_cce_kernel()
- Simplify mlx_trainer.py: remove CCE availability check and
  fallback warning (CCE is always bundled)
Clean up dead code and experimental code in mlx cce
- Add mlx_cce/ subpackage: chunked cross-entropy loss engine (pure Python + Metal shaders)
- Add mlx_trainer.py: MLXTrainer with compile, gradient checkpointing, LoRA+ support
- Add mlx_loader.py: FastMLXModel for model loading via mlx-lm
- Add mlx_utils.py: loss functions, batching, model save/export (safetensors, GGUF, Hub)
- Add stubs/ for triton and bitsandbytes on Apple Silicon
- Update __init__.py: detect MLX and skip GPU init on macOS arm64
- Update pyproject.toml: conditional deps (skip torch/trl/peft on Mac, add mlx/mlx-lm)
- Update llama_cpp.py: make torch import optional, fix CUDA-only assumptions
VLM support for MLX training, fp32 loss accumulation, train_on_responses,
fully-masked prompt check.

Conflicts resolved:
- pyproject.toml, .gitignore, __init__.py, device_type.py, llama_cpp.py,
  temporary_patches/misc.py: kept HEAD (already updated for unslothai main)
- mlx_loader.py, mlx_trainer.py, mlx_utils.py: took feat/mlx-vlm versions
  (new VLM features)
- mlx_cce/__init__.py: took feat/mlx-vlm version (restores install_mlx_fast_cce_loss)
- mlx_cce/runtime_cce.py: took feat/mlx-vlm version (fp32 grad_weight accumulation)
danielhanchen and others added 25 commits May 15, 2026 01:48
…**kwargs) (unslothai#651)

Generalises the Pixtral-specific tolerance from unslothai#650 to all drift
detectors that use ``_assert_params_superset``. Newer transformers
wrap an expanding set of model forwards as ``(self, *args, **kwargs)``
via attention-implementation dispatch and generic decorators. Recent
casualties on transformers 4.57.6 include:

  - tests/test_temporary_patches_exhaustive.py
      ::test_csm_depth_decoder_for_causal_lm_forward_named_params
      ::test_csm_for_conditional_generation_forward_named_params

The zoo patches at those sites pass through kwargs verbatim, so the
call sites stay valid; the static param-name pin just is not
meaningful against a passthrough wrapper. Skip with a descriptive
message instead of emitting spurious DRIFT.

The original FAIL branch is preserved for the case where required
names are missing AND the function does not take **kwargs (a real
upstream removal). Strict behaviour on transformers 4.x with
introspectable signatures is unchanged.
The previous default of 5.0 essentially never fired on real
transformer training: per-element gradients in steady state are
typically 1e-3..1e-1, so |g_i| > 5.0 is extremely rare even on
spike batches, mixed-precision overflow, or RL gradient bursts.
The threshold was nominally "elementwise clipping" but in practice
a no-op, leaving the trainer with no protection in the regimes
where clipping actually matters.

1.0 still keeps the trainer on MLX's fast per-leaf
``tree_map(mx.clip, ...)`` path (no global reduction, no
cross-leaf synchronisation) while actually catching outliers, and
aligns with the universal LLM ``clip_grad_norm=1.0`` baseline used
by HuggingFace Trainer / TRL / PEFT / AutoTrain.

Empirical sanity-check on a CUDA mirror of
``tests/studio/run_real_mlx_smoke.py`` (same model / LoRA / data /
optimizer; gradient clipping reproduced by ``clip_grad_value_``):

  norm_grad_norm=1.0  (control)   losses 7.64 -> 0.006  PASS
  max_grad_value=5.0  (old)       losses 7.29 -> 8.39   DIVERGED
  max_grad_value=1.0  (new)       losses 7.29 -> 3.4 (range across
                                   3 seeds 3.21..3.65; consistent
                                   plateau, no divergence)
  max_grad_value=0.5 / 0.25 / 0.1  noisier still

5.0 lets the optimizer overshoot the local basin on the 7-step
smoke; 1.0 produces a stable (if conservative) update regime
across seeds. On real long-horizon training, the difference is
almost invisible -- 1.0 fires only when needed.

Updates the dataclass default, the inline ``getattr`` fallback and
its sentinel branch in ``MLXTrainer.fit``. Behaviour preserved when
caller explicitly passes ``max_grad_value``; only the unset path
changes. The "max_grad_value wins when both set" policy is kept
for now to avoid silently slowing down callers who upgraded to
this build expecting the fast per-element path; a separate change
can revisit that.
…nslothai#647)

* saving: layout-aware MoE LoRA merge + loud-fail on fallback (#5410)

`save_pretrained_merged(save_method="merged_16bit")` silently dropped the
entire MoE expert LoRA delta on Qwen3-MoE / Qwen3.5-MoE-style models with
peft >= 0.19.1. The per-expert helpers in `saving_utils.py` hardcoded the
PEFT 0.18 "swapped" tensor layout (`lora_A: (E*r, 2I)`, `lora_B: (H, E*r)`
for gate_up_proj; `lora_A: (E*r, H)`, `lora_B: (I, E*r)` for down_proj),
while PEFT 0.19+ swaps in/out features for non-transposed 3D parameters
and produces `lora_A: (E*r, H)`, `lora_B: (2I, E*r)` and `lora_A: (E*r, I)`,
`lora_B: (H, E*r)`. The layout mismatch hit a bare `except Exception: return W`
and the dim-heuristic fallthrough in the fused helpers, so the merge
silently wrote unmodified base weights and reported success. The
`num_experts` value used by the per-expert loop was also taken from the
shard-local key scan, which is a non-divisor of `total_rank` whenever
experts are split across multiple safetensor shards (16/17 of 128 on
Qwen3-30B-A3B). Finally the merged dir was missing `generation_config.json`,
so chat-tuned models reloaded with default eos / sampling and ran past EOS.

Changes:

- `_detect_moe_lora_layout(lora_A, lora_B, num_experts, out_dim, in_dim)`
  classifies the layout by shape against the per-expert disk weight, so
  no version sniffing is required. Works on transformers 4.57.x / 5.x
  and peft 0.18.x / 0.19.x.
- `_merge_moe_gate_or_up_expert` and `_merge_moe_down_proj_expert`
  branch on the detected layout. The "swapped" path is byte-identical
  to the previous behaviour.
- `_resolve_num_experts_from_lora_stats` walks `module -> base_layer ->
  ...` to read the authoritative `num_experts` off the wrapped MoE
  module (`Qwen3MoeExperts` etc). `_merge_and_overwrite_lora` uses it
  to override `moe_num_experts[prefix]` after the converted-key build,
  so the per-expert loop never trips on a shard-local count.
- `_MOE_MERGE_STATE` tracks `(attempted, applied, fallback, first_error)`.
  Each helper records a fallback with role / expert / shapes / reason
  on any unrecognised layout or exception. After the shard loop
  `merge_and_overwrite_lora` raises `RuntimeError` if any fallback
  fired, so partially-merged checkpoints can no longer be silently
  written. On success it prints `applied/attempted`.
- The `merged_16bit` branch also calls
  `model.generation_config.save_pretrained(save_directory)` (best-effort,
  matching `fix_tokenizer_config_json`).

Tests:

- Existing 16 per-expert / fused / dense merge tests in
  `test_unsloth_zoo_lora_merge.py` still pass byte-for-byte (PEFT 0.18
  swapped layout is the default branch).
- 6 new tests:
  * standard layout for `_merge_moe_gate_expert`, `_merge_moe_up_expert`,
    `_merge_moe_down_proj_expert`,
  * layout classifier for both conventions and the unknown cases,
  * fallback counter increments and `first_error` populates on
    unrecognised shapes,
  * `_resolve_num_experts_from_lora_stats` walks the `base_layer` chain.

End-to-end verification on Qwen3-30B-A3B (128 experts x 48 layers,
fused 3D in memory, per-expert 2D on disk), full SFT + save + reload
+ logit compare:

| transformers | peft   | trl    | merged tensors | trained vs merged KL | samples |
|--------------|--------|--------|----------------|----------------------|---------|
| 5.5.0        | 0.19.1 | 0.25.1 | 18432 / 18432  | 1.6e-5               | 3 / 3   |
| 5.5.0        | 0.18.1 | 0.25.1 | 18432 / 18432  | 1.3e-5               | 3 / 3   |
| 4.57.6       | 0.19.1 | 0.25.1 | dense path     | 5.5e-5               | 3 / 3   |
| 5.5.0        | 0.19.1 | 1.4.0  | 18432 / 18432  | 2.1e-4               | 3 / 3   |

Before the patch the M1 row was KL=1.86, samples=1/3, and 0/18432 expert
LoRA deltas were applied. transformers 4.57.6 has `experts = nn.ModuleList
(Qwen3MoeMLP)` (no fused 3D parameter) so the MoE merge helpers do not
fire and every per-expert Linear takes the standard dense `_merge_lora`
path. The MoE helpers are unreachable on transformers <5; the patch only
affects the path that produces the bug.

Fixes unslothai/unsloth#5410. Likely also resolves unslothai/unsloth#4832
(same author, same "garbage after save_pretrained_merged reload" symptom
on DevStral Small 2).

* saving: bound _resolve_num_experts walk + harden against hostile attrs

The base_layer walk in _resolve_num_experts_from_lora_stats was an
unbounded `while module is not None` loop. PEFT's ParamWrapper does not
self-reference in practice, but a self-referential or cyclic `base_layer`
chain would hang the merge. Bound the walk to 16 hops, dedupe via an
id() set, and swallow exceptions on getattr / getattr-of-attrs so a
hostile module that raises on attribute access cannot abort the merge.

Confirmed by a synthetic suite (52 cases) across three isolated venvs:
peft 0.18.1 + transformers 5.5.0, peft 0.19.1 + transformers 5.5.0,
peft 0.19.1 + transformers 4.57.6. All 22 existing merge tests still
pass byte-for-byte in each.

* saving: trim verbose comments per maintainer style

Tighten the docstrings and inline comments added by the layout-aware
MoE merge work so the diff is closer to the surrounding house style
(see chore unslothai#640). No behaviour change; 22 / 22 merge tests still pass.

---------

Co-authored-by: Daniel Han-Chen <info@unsloth.ai>
…ai#653)

After zoo#647 the gate/up MoE merge entry points became thin wrappers
delegating to _merge_moe_gate_or_up_expert. The drift detector test
only inspected the wrapper source and so reported a false-positive
regression (no .to('cuda') call, no _active_merge_device call inside
the wrapper body itself).

Update test_moe_expert_merges_call_active_merge_device to follow one
level of delegation through sibling _merge_moe_* helpers via a small
regex-walked source-union helper, so the union still requires
_active_merge_device() and forbids hardcoded .to('cuda').

Local: pytest -k test_moe_expert_merges_call_active_merge_device
passes (1 passed in 0.26s).
PreTrainedModel.__init__ resolves loss_type via regex on the class name,
so Qwen3_5ForConditionalGeneration lands on
LOSS_MAPPING["ForConditionalGeneration"]. That key (and
CsmForConditionalGeneration) is aliased to the stock ForCausalLMLoss
in transformers, and patch_loss_functions only ever rewrote the
"ForCausalLM" entry. The result is that affected classes silently fall
back to the un-patched loss, which materialises (seq_len x vocab_size)
fp32 logits and OOMs on <= 24 GB GPUs at large vocab sizes.

Sweep every entry whose function is currently named ForCausalLMLoss
and replace it with the Unsloth kernel. Idempotent (after the first
call no key reports __name__ == "ForCausalLMLoss") and leaves
unrelated loss types alone.

Closes unslothai/unsloth#5441
unsloth/pyproject.toml ships intel-gpu extras pinning torch-2.11.0+xpu
and torch-2.12.0+xpu (added in unslothai/unsloth#5484), but
unsloth_zoo's own torch constraint was capped at <2.11.0. The pip
resolver consequently could not satisfy `pip install
.[intel-gpu-torch2110]` or `.[intel-gpu-torch2120]` end-to-end: it
either failed outright or fell back to a stale unsloth_zoo (2026.3.6)
that does not enforce the cap.

Relax the cap to <2.13.0 so torch 2.11.x and 2.12.x are admissible.
Leaves torch>=2.4.0 floor intact and keeps the future-version cap as a
safety net.
Adds an opt-in (UNSLOTH_FUSED_FORWARD=1) auto-installer that rewrites
the canonical lm_head + self.loss_function triplet on every transformers
`*ForCausalLM` / `*ForConditionalGeneration` whose forward matches the
shape used from transformers 4.56 onwards. Skipping logits.float() over
(seq_len x vocab_size) avoids the OOM that surfaced in #5441 and shaves
the bf16 logits tensor as well.

Layers:

  unsloth_zoo/fused_losses/forward_adapter.py
    Maps the HF self.loss_function(logits=..., labels=..., vocab_size=...,
    **kwargs) calling convention onto unsloth_fused_ce_loss. Pops
    num_items_in_batch -> n_items, threads ignore_index / label_smoothing /
    logit_softcapping / logit_scale_multiply / logit_scale_divide, and
    falls back to a stock CE if the caller passes a pre-shifted
    shift_labels tensor (unsupported by the chunked kernel today).

  unsloth_zoo/fused_losses/ast_rewriter.py
    NodeTransformer that recognises the canonical triplet:
        <NAME> = self.<HEAD>(<HIDDEN_EXPR>[...])
        loss = None              (optional)
        if labels is not None:
            <LOSS> = self.loss_function(<NAME>, labels, vocab_size=..., **kwargs)
    and rewrites it to call unsloth_fused_lm_head_loss(<HIDDEN_EXPR>,
    self.<HEAD>, labels, ...). Tolerates keyword vs positional vocab_size,
    `.float()` / `[slice]` chains around the lm_head call, and detects
    logits re-binding (e.g. Cohere's `logits = logits * self.logit_scale`)
    as a refuse signal so we never produce a broken forward.

  unsloth_zoo/fused_losses/forward_install.py
    Two-tier installer: (1) hash-allowlist fast path via
    register_canonical(hash, forward_fn); (2) AST triplet rewrite.
    Driven by a meta-path import hook that intercepts
    transformers.models.<X>.modeling_<X> imports and patches eligible
    classes as their module loads. Soft floor at transformers >= 4.56.
    audit() returns a JSON-safe dict of patched / unmatched / failed
    classes for observability.

Kernel updates:

  unsloth_zoo/fused_losses/cross_entropy_loss.py
    compute_fused_ce_loss + UnslothFusedLoss.forward now thread
    ignore_index (default -100) into the label-shift step and the inner
    F.cross_entropy call. compute_fused_ce_loss also accepts
    label_smoothing. Matches HF ForCausalLMLoss semantics so callers
    that override either no longer silently regress.

Tests (tests/test_fused_forward_install.py, 14 cases):
  - AST rewriter accepts keyword form, positional vocab_size, `.float()`
    wrapper. Declines non-canonical, declines on logits rebinding.
  - install_for_class: noop when disabled, skips ineligible names,
    patches canonical, idempotent, function-override fast path,
    audit() snapshot.
  - Numerical equivalence on a toy CUDA model: fused loss within
    bf16 -> fp32 rounding noise of the reference.
  - Kernel respects ignore_index and label_smoothing kwargs.

End-to-end equivalence on Llama-3.2-1B + alpaca-cleaned (seed 3407,
max_steps 10): identical step-1 loss + grad_norm, max |loss delta| =
0.005, max |grad_norm delta| = 0.025 across the run. Audit reported
19 classes patched, 0 failed when UNSLOTH_FUSED_FORWARD=1 (LlamaForCausalLM,
Qwen3ForCausalLM, MistralForCausalLM, Gemma2/3 / GemmaForCausalLM,
Mllama, DeepseekV3, Qwen3MoE / Qwen3Next, Bloom, FalconH1, etc.).

Off by default. Set UNSLOTH_FUSED_FORWARD=1 to opt in.
Forwards routed through unsloth_compiled_cache see __globals__ for the
cached module, which does not always re-import the HF output dataclass
the original modeling file referenced (e.g. Gemma3ForCausalLM's return
statement uses CausalLMOutputWithPast). Populate the exec namespace
with everything from transformers.modeling_outputs as a fallback so
the rewritten forward links cleanly.

Caught during multi-model equivalence run (Gemma3-1B fused) which now
matches the stock path bit-for-bit alongside Llama, Qwen3, Phi3, and
Mistral.
forward_adapter.py
- shift_labels fallback now uses reduction=sum and divides by n_items
  when num_items_in_batch is supplied, matching HF ForCausalLMLoss
  gradient-accumulation scaling.
- shift_labels=False (bool) now routes to the same stock-CE fallback
  instead of leaking through to the always-shifting fused kernel.
- Removed redundant inner import torch.

cross_entropy_loss.py
- Promote a non-tensor n_items divisor (HF trainers pass a Python int
  via gradient accumulation) to a scalar tensor before the existing
  DataParallel .numel()/.ravel() guard, which is preserved verbatim.
  Without the promotion an int n_items raises AttributeError inside
  the autograd forward.

ast_rewriter.py
- Capture the full lm_head RHS (e.g. .float()/.contiguous()/[slice])
  and emit it in the else-branch so the inference path keeps its
  original dtype/shape semantics.
- Only strip docstring-only decorators (auto_docstring,
  add_start_docstrings*, add_end_docstrings, replace_return_docstrings);
  @can_return_tuple carries return_dict=False semantics and stays.
- Reject forwards with non-empty else, multi-statement labels branches,
  or aliased labels arguments (CSM-style depth-decoder loss survives
  intact rather than being silently dropped).
- Reject forwards where any statement between the lm_head assign and
  the labels-if mutates or reads logits (Gemma3 final_logit_softcapping
  used to be silently bypassed by the fused-loss path).
- Forward explicit loss_function keywords beyond vocab_size (Bloom
  passes num_items_in_batch=kwargs.get(...) without a **kwargs unpack).
- _find_loss_function_call / _find_loss_assign_target now inspect only
  the direct if-body, so a nested guard inside the labels branch is
  not silently dropped.

forward_install.py
- Drop *ForConditionalGeneration from auto-install eligibility (the
  fused kernel hardcodes a causal shift; aligned-label seq2seq losses
  would be off-by-one).
- Skip composite/non-linear heads via a _LINEAR_HEAD_ATTRS allowlist
  so BigBird-style self.cls(...) (BigBirdOnlyMLMHead) is not patched.
- install_for_class / install_for_module now also gate on the
  transformers version floor, matching install_modeling_import_hook.
- Inject transformers.utils.generic.can_return_tuple into the exec
  namespace so the preserved decorator resolves at runtime.
Compress narrative docstrings and inline rationale blocks across
fused_losses/* and the __init__.py opt-in stanza. Load-bearing notes
(@can_return_tuple semantics, Gemma3 softcap reasoning, BigBird
composite-head guard, transformers >= 4.56 floor, ForCausalLM-only
eligibility) are preserved; only WHAT-restating prose was removed.
Cover the eight semantic fixes that landed in commit db90fa1 so
regressions are caught at test time rather than at training time:

  - test_ast_rewriter_declines_when_intermediate_touches_logits
    Gemma final_logit_softcapping between lm_head and the labels-if
    must not be silently bypassed.
  - test_ast_rewriter_declines_when_labels_aliased
    CSM-style `loss = self.loss_function(..., labels=backbone_labels)`
    on an `if labels is not None:` gate must refuse.
  - test_ast_rewriter_declines_non_trivial_labels_branch
    MoE-style auxiliary loss inside the labels branch must refuse so
    aux_loss + router_aux_loss_coef stays intact.
  - test_ast_rewriter_forwards_explicit_extra_kwargs
    Bloom-style `num_items_in_batch=kwargs.get(...)` without **kwargs
    must reach the kernel.
  - test_install_skips_for_conditional_generation
    *ForConditionalGeneration uses aligned labels; auto-install must
    skip.
  - test_install_skips_composite_head
    BigBird-style `self.cls(...)` composite head must hit the
    _LINEAR_HEAD_ATTRS allowlist and log as non-linear-head.
  - test_fused_kernel_accepts_int_n_items
    HF Trainer grad-accum passes a Python int divisor; kernel must
    promote it to a scalar tensor before the DataParallel guard.
  - test_adapter_falls_back_when_shift_labels_false
    `shift_labels=False` bool must route through stock CE; the fused
    kernel always re-shifts.

All 22 tests pass (14 original + 8 new).

Multi-model end-to-end equivalence rerun against the post-review tree
(seed 3407, max_steps=10, alpaca-cleaned):

  model            s1 eq  max|loss d|  max|grad d|  n_patched
  Llama-3.2-1B      True      0.00450      0.01276         11
  Qwen3-0.6B        True      0.00490      0.07686         11
  Gemma-3-1B        True      0.00000      0.00000         11
  Mistral-7B-v.3    True      0.00370      0.03093         11

Step 1 loss + grad_norm are bitwise identical for every model; n_patched
dropped from 19 -> 11 because ConditionalGeneration + Gemma2/3 (logits
touched by softcap) + BigBird (composite head) are now correctly skipped.
is_enabled() now returns True unless UNSLOTH_FUSED_FORWARD is explicitly
set to "0". Updated docstrings and the __init__.py comment to reflect the
new default. The two-tier installer + LOSS_MAPPING backstop in unslothai#656 means
the worst case for any class we touch is no-op (refused via _UNMATCHED
or composite-head guard) -- never a worse forward than the stock path.

Test suite (23 cases, was 22 + new test_install_default_is_on): all
green. Refresh of the multi-model equivalence rerun with no env var set
versus UNSLOTH_FUSED_FORWARD=0 (Llama-3.2-1B / Qwen3-0.6B / Gemma-3-1B /
Mistral-7B-v0.3, seed 3407, max_steps=10, alpaca-cleaned):

  model           off enabled  default enabled  s1 eq  max|loss d|  max|grad d|
  Llama-3.2-1B          False            True   True      0.00410      0.02336
  Qwen3-0.6B            False            True   True      0.00680      0.02561
  Gemma-3-1B            False            True   True      0.00000      0.00000
  Mistral-7B-v.3        False            True   True      0.00530      0.05310

Step 1 loss + grad_norm bitwise identical for every model; deltas across
the run stay within bf16 -> fp32 chunked-CE rounding noise. Audit
reports 11 classes patched at default and 0 patched when explicitly
disabled.
trl 1.x padding_free passes shift_labels=<tensor> through the loss
function. The adapter previously fell back to a materialised-logits
F.cross_entropy in that case, which kept the OOM problem the chunked
kernel was supposed to fix.

Plumb shift_labels through unsloth_fused_ce_loss instead. The outer
UnslothFusedLoss.forward already handles label shifting; when the
caller pre-shifted we just flatten and skip the inner re-shift.

Files:
- cross_entropy_loss.py: unsloth_fused_ce_loss gains shift_labels arg
  (default True). Outer adds an else branch that flattens pre-shifted
  labels so chunking aligns with hidden_states. The four inner
  accumulate_chunk call sites pass False unconditionally now since
  the outer always normalises labels.
- forward_adapter.py: drop the F.cross_entropy fallback. Pick (target,
  do_shift) based on the shift_labels kwarg and call the fused kernel
  with shift_labels=do_shift.
- test_fused_forward_install.py: rename the stale fallback test and
  add five fp32-strict numerical checks (atol/rtol=1e-5):
    * auto-shift matches F.cross_entropy
    * pre-shifted tensor matches F.cross_entropy
    * shift_labels=False matches F.cross_entropy
    * num_items_in_batch divides correctly
    * int and 0-d tensor n_items produce equal loss

Empirical end-to-end checks (10 step Llama-3.2-1B LoRA, max_steps=10):

  trl 1.4.0 padding_free=True, fused vs off:
    step 1 loss: 1.45730 == 1.45730 (exact)
    max delta over 10 steps: 0.003 (bf16 noise)

  num_items_in_batch wiring (batch=2, grad_accum=4):
    HF passes a scalar tensor, consistent across the 4 micro-batches
    in each window. n_items equals sum(non_ignore_labels) - rows in
    every window (the per-row causal-shift drop), matching the
    post-shift count HF uses for the mean reduction.

27/27 unit tests pass.
The fused-forward installer (forward_install.py) rewrites
*ForCausalLM.forward at import time. Two upstream-pattern tests used
inspect.getsource(cls.forward) and got the rewritten body, which no
longer contains the canonical upstream lines compiler.py pins.

Switch both probes to read the modeling module's on-disk source via
__file__ instead. That is the source compiler.py's rewriter actually
operates on, and it stays pristine regardless of any runtime patches.

Tests affected:
- test_compiler_cross_entropy_lm_head_pattern_present
- test_compiler_cross_entropy_find_2_loss_function_signature
* Honor UNSLOTH_RETURN_HIDDEN_STATES / UNSLOTH_RETURN_LOGITS in fused forward

The AST-rewritten forward installed by PR unslothai#657 only had two branches:
labels-not-None (fused CE, EMPTY_LOGITS) and else (real logits, no loss).
It silently ignored both env vars that the compiler-rewritten forward in
unsloth_zoo/compiler.py honors. For GRPO the compiled forward overrides
the AST one so this never mattered in practice, but it left the AST
forward behaviourally different from the compiled one and not safe to
rely on standalone.

Expand the rewrite template to the same three-branch shape as the
compiled forward:

  1. UNSLOTH_RETURN_HIDDEN_STATES=1 -> hidden_states in the logits slot,
     no lm_head matmul, no loss. GRPO's hidden-states fast path.
  2. labels is not None -> fused CE for loss; logits = EMPTY_LOGITS
     unless UNSLOTH_RETURN_LOGITS=1, in which case the original lm_head
     expression runs so callers can train + collect logits in one
     forward.
  3. otherwise -> original RHS verbatim, loss = None.

forward_install.py: seed the rewritten forward's globals with os so the
env-var reads work on classes whose original forward did not import os.

Tests: ordering assertion on the rewriter output plus four CUDA-gated
behaviour tests covering each branch and the priority of return-hidden
over return-logits when both are set.

* Drop UNSLOTH_RETURN_HIDDEN_STATES handling from AST forward

The hidden-states fast path is owned by the compiler-rewritten forward
in unsloth_zoo/compiler.py, which already overrides the AST forward for
every *ForCausalLM class that GRPO actually runs on. Honoring the env
var in the AST forward as well was defence-in-depth that nobody hits.

Keep the UNSLOTH_RETURN_LOGITS opt-in (closes a real gap: lets callers
collect real logits + train via fused CE in one forward).

Template now goes back to two top-level branches with a nested if for
the logits opt-in:

  if labels is not None:
      <fused CE>
      if UNSLOTH_RETURN_LOGITS == '1':
          logits = <original RHS>
      else:
          logits = EMPTY_LOGITS
  else:
      logits = <original RHS>
      loss = None

Tests trimmed to match (29 passed). The ns.setdefault('os', os) seed in
forward_install.py stays -- the UNSLOTH_RETURN_LOGITS read still needs
os available in the rewritten forward's globals.

* Avoid double lm_head matmul on UNSLOTH_RETURN_LOGITS=1 path

Previous shape called both unsloth_fused_lm_head_loss (which chunks the
lm_head matmul internally to compute CE) and self.<head>(<hidden>) (the
full matmul) when the opt-in env var was set. Two matmuls for one
materialised tensor.

New shape splits the labels branch into two paths and picks the right
loss path for each:

  if labels is not None:
      if UNSLOTH_RETURN_LOGITS == '1':
          logits = <original RHS>                       # one matmul
          loss   = self.loss_function(logits, labels,   # same logits
                                      vocab_size=V, ...)
      else:
          loss   = unsloth_fused_lm_head_loss(...)      # chunked, fused
          logits = EMPTY_LOGITS
  else:
      logits = <original RHS>
      loss   = None

The opt-in path now routes through the model's own self.loss_function
on the already-materialised logits. Matches HF's standard CausalLM loss
shape and the conditional in unsloth_zoo/compiler.py:2074.

Tests assert single-matmul + single-self.loss_function on the opt-in
path; numerical equivalence holds bit-identically on the toy in this
sim (5.003798 vs 5.003798).
…nslothai#664)

Today unsloth_zoo/__init__.py cross-syncs HF_HUB_OFFLINE and
TRANSFORMERS_OFFLINE so setting either one implies the other. The
datasets library has its own equivalent env var, HF_DATASETS_OFFLINE,
which is NOT in the sync. Result: when a user (or a downstream caller
like the studio inference worker) sets HF_HUB_OFFLINE=1 because DNS
to huggingface.co is dead, load_dataset() still issues a network call
for dataset metadata and raises ConnectionError instead of resolving
from cache.

Extending the cross-sync to all three so any single var implies all
three. Existing behaviour (only literal "1" triggers the sync) is
preserved.

Tests: 5 cases under tests/test_offline_env_cross_sync.py covering
each var as the trigger, plus the unset case and the non-"1" value.
Two distinct Studio export bugs, one shared theme: the MLX/Studio export
path had Unsloth-specific assumptions implemented in one branch (CUDA) but
not centralised enough for every caller.

Bug 1 — MLX/Studio q2_k_l preset crashes
  q2_k_l is an Unsloth-side preset (q2_k + --output-tensor-type q8_0
  --token-embedding-type q8_0), not a native llama.cpp ftype. The CUDA save
  path in unsloth/save.py:_quantize_q2_k_l already knew to expand it. The
  MLX path in unsloth_zoo/mlx/utils.py:save_pretrained_gguf forwarded the
  preset string straight to llama-quantize, which aborted with
      main: invalid ftype 'q2_k_l'
  on Apple Silicon. Move the expansion into quantize_gguf so every caller
  shares one source of truth. CUDA continues to route through its own
  _quantize_q2_k_l first; deduplication is deferred to a follow-up.

  The preset branch is preset-only: it triggers only when
  quant_type.strip().lower() == 'q2_k_l'. Every other ftype, including
  the real llama.cpp q3_k_l, takes the original code path byte-for-byte
  unchanged. Windows shell=True + _quote path-handling preserved.

Bug 2 — Patcher drift on the new conversion/ package layout
  Upstream llama.cpp split convert_hf_to_gguf.py into a thin entrypoint
  plus a conversion/ package (conversion/base.py, conversion/qwen.py).
  Three Unsloth regex patches in _download_convert_hf_to_gguf still target
  the old monolith, producing three confusing warnings and silent breakage:
    - "No supported architectures (TEXT or VISION) could be determined":
      ModelBase._model_classes is empty because per-arch modules register
      lazily via load_all_models().
    - "Metadata branding patch target not found": the
      self.metadata = gguf.Metadata.load(...) line moved to
      conversion/base.py, so GGUFs produced by Unsloth no longer carry
      quantized_by='Unsloth', repo_url, and tags.
    - "Qwen2MoE num_experts patch target not found": upstream already
      handles both keys via self.find_hparam(['num_local_experts',
      'num_experts']) — the legacy patch is obsolete.

  Patcher is now layout-aware (_detect_converter_layout). On the new
  package layout:
    - Supported-architectures are AST-extracted from
      conversion/__init__.py:TEXT_MODEL_MAP / MMPROJ_MODEL_MAP (the
      static dicts), avoiding the cost and side-effects of importing
      every per-arch module just to read its registration.
    - Metadata branding is patched IN PLACE on conversion/base.py with
      an idempotency marker (# UNSLOTH_BRANDING_APPLIED), so the GGUFs
      regain Unsloth branding.
    - The obsolete Qwen patch is intentionally skipped with an info log
      when conversion/qwen.py already searches both expert keys.
  Old monolithic layout: every existing patch runs unchanged.

  Cache-key extension: _download_convert_hf_to_gguf now folds
  conversion/{__init__,base,qwen}.py mtimes+sizes into the @lru_cache key
  so a re-pulled llama.cpp checkout re-runs the patcher.

Tests (24 added, all green locally on Python 3.13):
  tests/test_quantize_gguf_q2_k_l.py:
    - q2_k_l expands to q2_k + both q8_0 tensor-type flags
    - case-insensitive (Q2_K_L / q2_K_L / padded)
    - 17 other ftypes (q2_k, q3_k_s/m/l, q4_*, q5_*, q6_k, q8_0, bf16,
      f16, f32) untouched — Linux/Windows non-regression gate
    - print_output logs both the request and the expansion
    - error messages keep the preset name (q2_k_l), not the rewrite
  tests/test_convert_hf_to_gguf_patcher.py:
    - synthetic package + monolith layouts: detection, AST extraction,
      branding patch idempotency, lines around target preserved, Qwen
      alias detection, cache-key invalidation
    - live upstream raw.githubusercontent.com fetch of current llama.cpp
      master: layout=package, branding patch applies cleanly, Qwen aliases
      already handled, TEXT_MODEL_MAP contains LlamaForCausalLM + Qwen*

CI: new .github/workflows/studio-export-fix-ci.yml runs the unit tests on
ubuntu-latest + macos-14 + windows-latest, max-parallel: 3, with paths:
filters so unrelated pushes don't re-fire. Pure-Python (no torch needed),
each job finishes in well under 2 minutes.
@danielhanchen

Copy link
Copy Markdown
Owner Author

Throwaway dry-run, validated on Linux / macOS / Windows runners. Real fix is at unslothai#667. Closing.

@danielhanchen danielhanchen deleted the studio-export-q2kl-fix branch May 19, 2026 08:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants